The Data Stack Show - 198: Building AI Search and Customer-Enabled Fine-Tuning with Jesse Clark of Marqo.ai

Episode Date: July 17, 2024

Highlights from this week’s conversation include:Jesse’s background and work in data (0:35)E-commerce Application for Search (1:23)Ph.D. in Physics Experience Then Working in Data (2:27)Early Mach...ine Learning Journey (4:35)Machine Learning at Stitch Fix (7:28)Machine Learning at Amazon (10:39)Myths and Realities of AI (13:49)Bolt-On AI vs. Native AI (17:26)Overview of Marqo (19:46)Product launch and fine-tuning models (23:02)Importance of data quality (25:38)The power of machine learning in search (32:02)Future of domain-specific knowledge and product data (34:08)Unstructured data and AI (37:19)Technical aspects of Marqo's system (39:42)Challenges of vector search (43:27)Evolution of search technology (48:15)Future of search interfaces (50:43)Final thoughts and takeaways (51:53)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. All right, welcome back to the Data Stack Show. We are here with Jesse Clark from Marco. Jesse, welcome to the Data Stack Show. We are super thrilled to talk
Starting point is 00:00:38 with you today. Yeah, great to be here. Thanks so much for having me on. All right, you have a really long, interesting history, which we're going to dig into in a minute. But give us the abbreviated version. Abbreviated version. Yeah. Started out physics, PhD, looking at very small things. Spent about six years in academia, then decided that this wasn't for me. Moved into women's fashion, doing data science at Stitch Fix.
Starting point is 00:01:01 Segwayed into Alexa at Amazon, robotics and search, and then founding Marco, which brings me here today. Awesome. Time to discuss. All right. Yeah. Welcome on, Jesse. So we talked a little bit before the show about the e-commerce application for search, some on the history of search, and why it's so complicated and messy right now. So what are some topics you're interested in covering? Yeah, really excited to talk about machine learning and search, vector search, some of the new capabilities that really unlocks and really forward-looking as well,
Starting point is 00:01:33 because I think that we're going to see a large evolution in the way that we search. We've already seen some of that with things like chat GPT and these kind of question and answering methods. So, you know, forward-looking, I think it's really exciting. Yeah, awesome. All right, shall we do it? Let's do it. Jesse, so glad to have you on the show. And I have to entertain my curiosity just a little bit before we dive into data stuff. To ask you about physics, you have a PhD in physics.
Starting point is 00:02:01 You studied very microscopic things. If my very limited understanding is correct, you sort of almost replicated like microscopes that are too small to have a lens for or something, which is insane. So I just have to know, like, having a PhD in physics, what was the most surprising thing you discovered or sort of the most unexpected thing in you know learning so much about physics yeah that's a great question and yeah really well understood in terms of what i was doing from that very brief chat in terms of i think what was the most surprising thing with physics i did a lot of experimental physics was just how good you had to become in these adjacent areas so things like you know electronics plumbing we used areas. So things like, you know, electronics, plumbing,
Starting point is 00:02:49 we used to do a lot of these, you know, we lived and died by experimental data and we'd go to great lengths to collect this data. And so we'd be living, you know, in a lead hutch for six days, you know, collecting all this data. And so we'd have to, you know, program robots, you know, to collect the data, we'd have to hook up vacuum pumps, you know, do the electrical, you know, ourselves, but none of this was taught to you. So had to just work it out and this was you know a while ago now there was much less resources on the internet and so it was really just like you know having to there's no choice but to work it out on the spot yeah that's not theoretical physics that's not theoretical yeah i mean so my conclusion from this is like if I'm going on an adventure and I need a real renaissance man who can, you know, figure out how to hook up a vacuum, I just need to find a doctorate in physics, you know, to do this thing. Where's somebody you want for your team on Survivor?
Starting point is 00:03:36 That's true. Awesome. So we have a ton to cover today, talking about AI, talking about search, all the challenges of the search, talking about how those two worlds are colliding. But what I'd love to do, you have been in the depths of machine learning for years now. And so you're absolutely one of those people who I consider, you were doing machine learning and, you know, quote unquote AI. We can talk about that, you know, how that term we want, but you were doing that way before it was cool. So early on at Stitch Fix, you were doing machine learning across a number of disciplines from, of course, recommendations, but also, you know, demand forecasts, etc. Can you just give me, well, actually, why don't we just start there? Can you
Starting point is 00:04:26 just give your machine learning journey? So after you did your PhD, what types of machine learning stuff did you work on and where? Yeah, I think it was quite organic. You know, when you did sort of the PhD, you know, a lot of the, you know, a lot of the things that you have to do is you need to, you know, solve a lot of problems, you know, analyze a lot of data and, you know a lot of the you know a lot of the things you have to do is you need to you know solve a lot of problems you know you analyze a lot of data and you know that comes out you've got to write the programs and so it starts to become this sort of very natural evolution you've got to write these programs for analysis you've got to write these algorithms for analysis and so you sort of automatically start you know basically developing your own kind of machine learning libraries as part of your phd and so you know in physics there was a huge amount of talk
Starting point is 00:05:04 back in the 2000s about big data. You know, we had huge amounts of data at the time. It was petabytes of data. We used to carry suitcases of hard drives back from these experiments. We had so much data, we just didn't know how to analyze it, really a lot of it. And there was, you know, everyone talked about this future state where we could get all this information from the data.
Starting point is 00:05:20 But at the time, we were like, I have no idea really how we're going to achieve this. Of course, now, you know, looking back, we have so many tools now and big data really is something that could be leveraged and so that's amazing to see that you know it took a long time took longer than i think everyone expected but you know sort of almost 20 years ago it was such a hot topic and now 20 years later we do have a lot of the tools so yeah the machine learning kind of happened organically i didn't realize at the time it was even probably called machine learning you You know, I just thought these were algorithms that we had to do. And then it wasn't until I started to look outwards, you know, from my own discipline that I realized, actually, these are very similar things, you know, they're applied in many different ways. And they're really,
Starting point is 00:05:56 you know, really valuable to a lot of other functions, which are, you know, out of sort of core, you know, science and physics. At what point do you remember the moment when you realized okay i'd build these algorithms to you know operate on these petabytes of you know experiment data what was the point in which you realized like okay well i actually have a machine learning skill here that i could take to industry you know outside of academia do you remember that moment yeah i think it was i think it was slightly different i didn't think that i had the skill i was more like i suddenly realized i lacked a lot of skill and so i was you know i was in my you know physics domain and then
Starting point is 00:06:34 started to look you know outside of it and really think about you know oh yeah exactly i was like i learned all this stuff i must be able to apply it i'm you know i've got to be pretty you know pretty good at this straight away and then started to look at it and realized, oh, hang on a minute. I actually know nothing here. I need to learn a lot more. This was probably, this was quite a long time ago. So it was very early on in the machine learning journey. But I think it was that realization that there was this huge amount that I still need to learn, which motivated me a lot to actually sort of really cover those gaps. Of course, a lot of the stuff that you do learn in fields like physics actually does have a lot of counterparts in other fields like machine learning they're called the terminology is very different so again once you sort of
Starting point is 00:07:12 realize that then you actually you know again actually i know more than i thought i did you realize what the mappings are between these subjects yeah makes total sense so with stitch fix your first sort of in industry job and machine learning after after you left academia yeah exactly so yeah moving into yes stitch fix was the first sort of yeah industry job you know full-time uh data science and machine learning and so that was you know really exciting i mean a lot of the the one thing that was so amazing as well was just how you know in physics and particularly in experimental physics, you know, the quality of the data and then the quality of the analysis really dictated, you know, the outcomes. And that was exactly the same straight away.
Starting point is 00:07:52 You know, I could recognize that that was the same in industry, that it was, you know, the same kind of primitives really thinking about, you know, the data and being really careful with that and thinking about how to actually, you know, drive, you know, sort of, you know, outcomes. Yeah. So rumor has it, and I don't remember what year it was, but Stitch Fix got so good at some of their recommendation algorithms that it was kind of that little rivaled like Netflix level. I don't remember the like what year it was,
Starting point is 00:08:18 but at least on the data community, it was like people were super impressed with the signal that you guys were able to extract like at what point were you you know you there and like how did that come about like what do you think some of the keys to the success of data science because most companies go the data science route hire data scientists and like it just doesn't pan out how they want and it seems like savage fix was very much the exception to that yeah it was very interesting i think they worked very hard to sort of build an environment that allowed people to kind of do,
Starting point is 00:08:49 have quite a bit of freedom in terms of exploration, but then once something, you know, hit a rich vein, then it's this sort of exploitation and putting it into business. But I think one of the secrets was honestly like, you know, a bunch of reformed physicists working on these data problems. Yeah. There was a lot of, I mean, it was an incredible kind of mix of people. There was, you know, a lot of PhD physicists, you know there was that yeah there was a lot of it was i mean it was an incredible kind of mix of people there was you know a lot of phd physicists you know computer scientists neuroscientists you know social scientists so a lot of people who had you know
Starting point is 00:09:12 deep expertise and a lot of experimentation and data and so i think that was really the key was that people you know were fanatical about this stuff you know you just didn't leave anything to chance it was you know wake up in you know the middle of the night and think that oh i haven't you know i've missed this piece of thing in my data code in my etl i need to fix that otherwise my downstream is going to be cooked and so i think it was really that you know combination of just getting people who were you know really love data and then giving them the freedom to kind of execute yeah yeah and it sounds like a diversity of backgrounds was helpful there too. I mean, even from your very practical hands-on experience
Starting point is 00:09:47 with data and physics, it's very different than hiring kind of a team of PhD data scientists that all kind of have a uniform background, right? That I find sometimes doesn't get fully translated into actionable
Starting point is 00:10:03 intelligence. So yeah yeah that's cool yeah yeah i think it's i think it's actually one thing that was really noticeable as well as yeah the diversity of backgrounds because again these sort of problems that crop up people have seen and it might look slightly different in their field and then they can bring in different lanes they've got different tools and so you get this sort of you know much there's really good sort of better together story where people are able to bring in a lot of these other ideas and solve these problems so you moved from stitch fix to amazon and you did a couple things at amazon so give us the the overview of you know what types of machine learning problems did you solve at amazon yeah it was really exciting i
Starting point is 00:10:41 joined amazon i didn't even know what i was going to be doing there. So I was, and it was a little bit when I probably tried a little more color. I was living in California at the time. My wife, we just had given birth to twins as well. And so then decided to, you know, take this job with Amazon,
Starting point is 00:10:56 move to Seattle and not actually, I wasn't, I didn't even know what I was going to be doing there. So it was a huge leap of faith. Really, you know, thank my wife for the support there and so yeah went and worked at Amazon on a sort of top secret project basically was the sort of you know number three or four
Starting point is 00:11:12 hire in the team and so this was really ambitious zero to one kind of project so this was really exciting and I think you know Amazon has a reputation you know for these projects and it's certainly true they're really taking you know a big bet and trying to make you know, for these projects. And it's certainly true. They're really taking, you know, a big bet and trying to make, you know, really remarkable things happen. So that was, you know, really just great to see how that sort of evolved and, you know, being connected, although deeply technical and working on the machine, we're in there for this initial project.
Starting point is 00:11:36 It was still very connected to the end, you know, what customer problem we're solving and how do we take, you know, this technology, which is quite complex, it's a bit nascent, still has a number of rough edges, but how do we make that into something that customers are going to love and buy? And so that was really interesting, not just from the technical perspective,
Starting point is 00:11:51 but this sort of holistic product development perspective and just seeing that iterative cycle. And then, yeah, moving on, you know, we lived on to robotics after that. Again, just saw, you know, a huge opportunity in terms of, you know, Greenfield projects taking on something ambitious. Again, you know, the fulfillment center is obviously a huge part of Amazon. And the efficiency there, you know, they've been able to, you know, really drive down, you know, it's quite remarkable. And so, again, to be able to, you know, have a potential impact there was really, really interesting.
Starting point is 00:12:19 So, yeah, developing a lot of the intelligence for robots, you know, all the sort of machine learning models to help them see basically and understand the world. And then, you know, after that, it's been about two and a half years doing that. Also spent some time in sort of retail and shopping. So, again, you know, starting to think about, you know, how do we improve, you know, these experiences online? So there's a whole bunch of different sort of aspects that this touched, you know, sort of how do you help sellers? How do you help people discover things? yeah this was yeah super interesting well i want to zoom out a little bit because that is such a wealth of experience across a variety of machine learning disciplines you know sort of from making a recommendation and, you know,
Starting point is 00:13:10 making a clothing recommendation, stitch fix and fashion to robots, which is crazy and a non-tribial, you know, sort of scale of Reno RoboScience, obviously. So can we talk about the AI landscape for a little bit? And maybe a good question to start with would be what are sort of of the big myths or like if you could sort of dispel, you know, it's like there's so much hype out there. If you sort of had a couple of top things where you're like, you know,
Starting point is 00:13:34 these are the things that really bother me when I see these headlines. What are those things for you? Because you are actually, you know, truly experienced in this and then literally building sort of like AI focused technology, which we'll talk about in a minute. Yeah, great question.
Starting point is 00:13:50 There's a lot, I think, that probably gets me a bit riled up at the moment. I think some of the things in AI, I mean, a lot of it really is that it's, you know, it's not magic. And, you know, there's really no sort of, you know, silver bullet here. These problems, you know, to solve them with AI still requires a lot of the same things that have always been required. You need really good data. You need really disciplined approaches. You need really good evaluations. You need to understand what's happening.
Starting point is 00:14:15 So I think just being grounded that this is still a tool. AI is a tool. It's a technology that you can use. And so all of the usual rules apply. I think in terms of the things that are you know sort of a bit misrepresented you know i think we saw a lot particularly it's the hype started down a little bit but particularly when a lot of the you know large language models started to come out with these incredible skills you know people started to talk about all these emergent capabilities and whatnot you know but i think now it's actually become you know evident that actually these emerging capabilities aren't so emergent it's really that you know people just
Starting point is 00:14:47 didn't stare at the data long enough and actually a lot of this stuff was already in the data and that this is entirely expected and so you know training these models is not again it's not magical you don't just get you know suddenly get this agi you know with our current sort of you know crop of llms you know it's really you know, what's in the data is very much like training, like training or something. You train a muscle, it gets bigger. You know, you train an LLM on this data, it gets better at it. And that's really it.
Starting point is 00:15:11 And I think that, you know, we need to sort of just be really sort of grounded in, you know, what, you know, what we can do and what we want them to do. Yeah, yeah. That makes total sense. Well, help us understand. So if you have to break sort of like, let's say, modern AI down into sort of its part components from a technical perspective, could you break those down for
Starting point is 00:15:33 us? So like, you know, you talked about large language models, there's vector databases, I mean, you have, you know, RAG applications, you have, you know, sort of, but what are sort of the main components when you think about a modern ai application what are the core pieces of that yeah another good question i think the core pieces yeah i mean ai is you know i mean as the name suggests artificial intelligence and it's sort of much more encompassing than something like machine learning which is you know much more sort of specific and so i think ai you know definitely is much more than you know
Starting point is 00:16:04 just an individual component in terms of like a model. It's really worth a lot more than the sum of its parts. And that's what these systems are actually able to sort of do. And so I think from AI, you know, some system perspective in the different parts, you know, there's obviously this, you know, at least in the modern AI, we've sort of centralized around deep learning models or some other machine learning model, which is kind of driving it. But then you have all of this, know peripheral you know components around that and so exactly like you said you know being able to store and retrieve data you know vector databases vector search as we've seen you know augmenting large language models with the ability to retrieve information this has evolved now into say tool use and so now you can actually you know not just
Starting point is 00:16:42 have you know a single you know database for example but you can actually have other functions that can get called out so maybe it needs to request something you know whatnot so you've got these sort of systems now which have you know integrating a whole bunch of other sort of you know data management systems and you know again like serving inference you know but a lot of it actually looks very similar to i think what's happened in software engineering before again you know this isn't you know and you know, you know, although it's, you know, super impressive and really powerful, you know, a lot of the same engineering practices still apply. And so you've still got all these other components, you've got the serving, you've got the interface, you've got the, you know, obviously a lot of the, you know, the sanity checking and the sort of safety around it,
Starting point is 00:17:18 particularly with language models we've seen, you know, with, you know, about the prompt engineering, the prompt injection. Yeah. so I think that's probably it. Well, how would you describe, because I've got two mental models here, especially with SaaS apps. I've got one which I'll call bolt-on AI, which is I have a product and then I call
Starting point is 00:17:37 the chat GPT endpoint and like, you know, return something where I think of these things where they just kind of add an AI. I think there's some problem augmenting. It's kind of a little problematic. They make your problem better. Yeah, where it writes. But in essence,
Starting point is 00:17:56 they're just kind of white-winging what ChatGPT can do and then focusing toward their product, which is fine. Versus what I think you guys are doing, where it's truly a native AI. And I guess, one, how do you communicate that to people? And two, maybe what are some of the challenges with so much noise in the space and so much,
Starting point is 00:18:13 yeah, I was probably the best way to say it, just a very noisy, loud space. How do you communicate that to people? Yeah, I think you're absolutely right. And there's sort of this, you know, consuming of AI, you know, to build something and then sort of producing AI as well. And so there's, you know, a big distinction here. Previously, it was much more about, you know, producing, you know,
Starting point is 00:18:31 the AI and less, you know, there's much less capabilities in terms of being able to consume it and use it. And we've seen, you know, now people are able to integrate it just through a single API and now we're sort of people are AI enabled. But I think, you know, and certainly now the term has become so overloaded, it's very hard to actually distinguish, you know, what's fact from fiction. And I think we'll sort of move, you know, at some point we'll move very quickly, I think, away from people sort of, you know, talking about AI as powering their application. You know, I think probably when electricity came
Starting point is 00:18:56 about, people were like, you know, electrical powered, you know, instead of cooking or something like that. It was a big, you know, selling point. But, you know, now if you sort of said that for a lot of stuff, people would probably look at me quite strangely. Of course, you know, they'd be like, why aren't you using, you know, electricity to do this stuff? Like it would be odd if you weren't. And so, you know, I think we'll get to the point where it becomes so pervasive that AI is just expected
Starting point is 00:19:17 to be part of it. But I think in terms of, you know, being able to sort of, I guess, talk about it, I think we focus a lot on, I think, you know, outcomes, I guess, as well, I guess, talk about it. I think we focus a lot on, I think, you know, outcomes, I guess, as well, and just sort of business value. And so somewhat, you know, sometimes the sort of, you know, in the middle is actually not, you know, less important as to what the actual business value is
Starting point is 00:19:35 and what the sort of outcomes you can derive are. And so I think just trying to focus on, you know, what is it less about, like, you know, the technology and the sort of solution, but really about, like, the problem and actually solving that? Maybe it'd be good for you to just give us an overview of what Marco does before we dig into search, because you mentioned a couple of things when you were giving us the overview of the landscape.
Starting point is 00:19:58 There's vector search, there's vector databases. When you look at Marco, it has, you know, I think several components there, but just level set us on Marco. And then I'd love, because you have tried everything in search, John. And so I'd love for y'all to talk about, you know, why it's a hard problem in here, the history, but yeah, start out with just telling us what Marco does and sort of, and I guess maybe on the vector side, that would be most interesting because when you hear vector search, you kind of instantly think vector database, there's a bunch of vector databases out there. Like not all of them are created equally, obviously.
Starting point is 00:20:34 So help us place Marco correctly in the way that we're thinking about it. Yeah, absolutely. So Marco is one way to think of it as a vector search sort of platform. And so the reason that we're calling it that and working towards this is that vector search itself requires much more than just a vector database. And so vector databases are built around similarity search. So you put in a vector and then you can find the nearest vectors and that's effectively your search, that return lift.
Starting point is 00:21:03 These things are the relevant things in terms of a vector sort of search process however this is still very primitive operation and actually building any kind of search system requires you know much more components from you know than that and in fact vector search itself requires you know a whole bunch of additional machine learning components and so you know moving what with marco we're moving beyond just you know focusing on this similarity piece and actually think more holistically, how do we actually bring this vector search technology to developers so that they can actually integrate it? And so with the current sort of wave of solutions,
Starting point is 00:21:34 the vector database, and then everything is left as an exercise to the engineer, where they've got to now implement all of the orchestration, the abstraction layer that handles the machine learning. And then once they've got that, they've got, you know, now you've got your sort of hello world example. But then if you actually have any sort of suitably valuable, you know, search, you know, search bar or application,
Starting point is 00:21:54 you actually need to really think about how do you actually tune the search? How do you develop the models? And so this is sort of the third piece. And so what we've done with Marco is really think, okay, how do we actually, you know, if someone's going to actually put put this in production have a long-lived service that drives a lot of business value they're going to have to you know cover off on all of these components and these are you know quite different technical you know technical domains require a lot of expertise
Starting point is 00:22:16 and so what we're doing is we have the vector database we have the abstraction layer the orchestration machine learning the inference so people can get started straight away it's documents in documents out so you can search with text, you can search with images. And then we're now building into this place where you can actually start to fine-tune models and actually integrate behavioral data, feedback from users and have this continual system that continues to learn.
Starting point is 00:22:41 And so you've got this search system, which is really performant, covers off all of the components and just continues, you know, gets better over time. And is that last part, because you just had a big product launch. So can you tell us specifically, because it sounds like that last part is sort of,
Starting point is 00:22:54 because I know out of the box, you could plug Marco in and get, you know, you could get, you know, much more relevant search out of the box. But that last part, it sounds like now you're enabling your users to integrate their own sort of first-party data. Is that sort of like bringing around embeddings? Tell us about the launch.
Starting point is 00:23:13 I'd love to hear a little bit more about that. Yeah, really exciting. So like I mentioned, to get vector search to be really valuable long-term, you need a few different components. And so the initial market focused on the vector database and the abstraction layer. And like you said, now we've just launched the ability
Starting point is 00:23:29 for customers to fine-tune their own models on their own data, get really domain-specific models which really understand the nuances. And so I think everyone is familiar. If you search for even sort of basic things like genes on different websites, the notion of what genes would be is very different. And customers have a particular know, very different. And customers, you know, have a particular, you know, flavour, particular style, you know, and so being able to actually capture
Starting point is 00:23:50 a lot of these nuances now is what this new product launches that enables customers to do. So they can actually fine-tune the data, fine-tune the model story on their data to get it to really understand, you know, what their customers and, you know, the language they're using. It allows, you know, comes off on, you know, maybe there's new terms, there's slang terms,
Starting point is 00:24:07 maybe it's multilingual, maybe it incorporates multiple languages. All of this now can actually be learned and then served and integrated into Marco to provide much, much better results. So I'm curious, especially with your background in statistics, I've noticed this trend
Starting point is 00:24:22 where people are more concerned about privacy, tracking, you know, things with the best stream is getting harder for first party data. So I've noticed this trend and I think stitch sticks did this well early.
Starting point is 00:24:34 These companies are more likely to just ask people. Like quizzes are really popular right now. And these are, yeah, like, which is a funny, like full circle thing where we went like way off and then we don't want to have to user input yeah like which is a funny like full circle thing we went like way off of it we don't have to do anything let's just try it let's like see where
Starting point is 00:24:49 they click do this yeah it's just like digitizing the retail interaction right it has in store right so i'm interested in like maybe some applications there for like your technology where it's like oh people are going to ask like well we just run a database i'm like well let's store it in a way that i can interact with it and then you have that higher quality more precise data is that do you think that's important for like the future of search especially in you know ai yeah i think the data again is incredibly important if not more important now because you know the ai really you know it's trained on a particular set of data it'll be trained on you know particular sort of style of data. And so what can happen,
Starting point is 00:25:27 this is a long-known problem in machine learning, is you can have these sort of distribution mismatches. So if the machine learning models were trained on a particular type of data and now it starts to see different data, it may behave slightly differently or a bit worse. And so data quality, it's never going to go away. I think it's just, you've got to be fanatical about the data quality,
Starting point is 00:25:46 and that will always pay dividends. And so that really... So you talked about the history, and you looked at multiple solutions. Can you just give us a brief overview? What types of solutions have you tried to purchase? But then I'd also love for you to bring up, what were some of the dreams you had about what you could do with search that were just impossible?
Starting point is 00:26:05 But like, and then Jesse, I'd love to hear like, you know, are those things that Marco addresses? Yeah. So, I mean, my history with search actually goes way back to there's an open source solution called Apache Solar that had a previous job we used in-app that powered the search for the app we were building at the time and and i was part of the admin team so we were doing like the mappings and trying to run that and kind of like building all the indexes yeah like rebuilding indexes in the middle of the night because things broke i mean it was so much work and then we moved to elastic search was kind of the next generation like oh this oh, this is nicer. Like, this is better.
Starting point is 00:26:46 Yeah. And then, you know, Amazon started hosting that for you. And, like, there's kind of this progression. And so we moved out of that and moved into e-comm. And then I come into e-comm and we moved to Shopify. I'm like, okay, like, let's look at Shopify's built-in search. We ask around, we talk to people, and nobody uses it. And I try it out and was like, okay, all right. to people and nobody uses it and i try it out was like okay all right i
Starting point is 00:27:05 understand why nobody uses it um but then that discovery process was just so surprising there's some kind of entrenched e-commerce solutions that have been around for a long time that just do e-commerce search and then they're like oh we're gonna add Shopify like because we had you know this is the new thing but they still have that like old model there's the schemas for a lot of them are very rigid of like what we need you to put color here and size here you know whatever your parameters are and I mean we spent thousands of hours cleaning up getting getting data into fixed schemas for search to work, basically. Wow. Did you look at Algolia? We started with one of those
Starting point is 00:27:51 more bespoke vendors. We moved to Algolia, which was better. It was a little bit more flexible. You can feed Algolia actually behavioral type data and do some neat things with them. And then a couple years
Starting point is 00:28:08 into it, they're like, oh, we've got all these new AI features. It's like, oh, this is neat. So we turn them on and it's like one of them was an AI prediction feature. Get insights into what people are typing in and we don't return any results. And I mean, this was...
Starting point is 00:28:24 I think we had, we had over a million visitors a month, it's not a small traffic site. And like the insights were used, but there wasn't much there. Like the bad like synonym, like recommendations, like somebody would type one thing and then it was just kind of a disappointing, you know, experience. And I'm sure it's probably developed in the last couple of years. It's developed more. But I think the biggest disappointment
Starting point is 00:28:51 was around discoverability. If I knew a keyword that was in the name of the product, it worked. If I knew this new part number, it worked. So what was it? So describe a user problem to me and then this is jesse where i'd love for you to like weigh in here describe your user problem so keyword in the name of a product but a lot of times especially because you were dealing with like
Starting point is 00:29:15 water filtration a lot of times people are describing their problem though right right i filtered this out of my water yeah right was that like a yeah describe the problem though. Right. Right? I filtered this out of my water. Yeah. Right? Was that like a? Yeah. Describing the problem was one of them. The biggest one in that space was, does this work with this? Oh, interesting. So like relationships..
Starting point is 00:29:37 You're trying to connect this pump to this tube. Yeah. Or like this. Vacuum pump. Yeah. There you go. You're back. Shut up. this pump to this tube and you're like vacuum but I mean that was hard so we built all these data models like fits in or like compatible with
Starting point is 00:29:53 and it just gets and it's just being this web of relationships and it gets really complicated I think that was probably the most one of the most difficult problems is people like will this work for me or like you get into materials or like a question like will this work at this temperature and that it wasn't just specifically listed on and like a property well
Starting point is 00:30:19 the long tail of like indices and like contingencies is yeah. But it might have been buried in a description somewhere. At least for us, any of the long descriptions, most of them had pretty valuable information, but they were basically inaccessible from search. All right, Jesse, solve our problem here. Solve our problem. That's right. There's a bit to unpack. I think one, like you described, the first problem was just the difficulty to use even
Starting point is 00:30:48 sort of keyword systems before actually having any machine learning involved. And so certainly one thing that was focused a lot on at Marco is being able to make a lot of the vector search technology really accessible to developers. And so taking away a lot of that, just maintenance and that sort of back end stuff, it's really hard to developers. And so, you know, being, taking away a lot of that, you know, just magnets and that sort of, you know, backend stuff, it's really hard to manage. And so that's part of the, you know, part of the value composition is that, you know, we take care of a lot of that. And so that sort of developer experience so you can get up and going is, has been really a core focus. And then, like you say, in terms of, you know, being on all these different problems, you know, with keyword search, you know, if you know exactly what you want, it's fantastic. It's literally just finding the exact same phrase. But a lot of times,
Starting point is 00:31:28 you don't know what the correct language is. You don't know how to articulate it. Maybe it's a question. Maybe it's buried in the description. And so what we've seen with these machine learning based techniques, particularly around vector search, obviously, is where you can basically define your own relationships in terms of what's similar. And so people start asking these questions or start even querying in some very different way. You can actually start to learn these mappings. And so it's very flexible about what you actually define as being similar. And so with search, someone puts in a query and then they've got back products and these have this natural relationship of similarity. And then you can actually learn all this similarity
Starting point is 00:32:04 as well from these past interactions. And so it similarity. And then you can actually learn all this similarity as well from these kind of past interactions. And so it's so powerful that you can define what is similar. There's no canonical, you know, this is similar to that, but you define that through these relationships. And so that enables you to now ask questions. You know, you can do really anything you want. So it's incredibly powerful.
Starting point is 00:32:22 Yeah, one of the unlocks for me in doing all this search research is a recommendation is a search executed on your behalf without input from you, which seems obvious if you think about, but with the recommendation of their search problem. Yeah. Like it's just not something that I've never framed it that way, but it makes total sense. Yeah.
Starting point is 00:32:44 Yeah, exactly. I think that's, yeah. Not many people I think have quite realized that, and yeah, I've never framed it that way, but it makes total sense. Yeah, yeah, exactly. I think that's, yeah, not many people, I think, have quite realized that, you know, search and recommendations are really, you know, two sides of the same coin. And especially when in e-commerce, when you've got, you know, a vague head query, you know, maybe it is just like an item of clothing,
Starting point is 00:32:57 like T-shirt, and there are actually, it's not one result. It's not like information retrieval where you're asking a question, you know, what is the atomic weight of gold, for example, and it's got a very specific answer and you might only have one thing that matches that. You've got this sort of degeneracy where there's a lot of potential matches. And so this is like a recommendations problem.
Starting point is 00:33:12 But then you segue into this, you know, more verbose queries, for example, which may only have one match. And so it's this fluid transition between recommendations and search. And so I think being able to think about it, think about it and these different queries and what they actually require is definitely the right approach. Yeah, I'm really curious, especially for, let's call it, kind of specialty e-com applications where some domain knowledge is required to purchase the right thing. Let's talk car parts.
Starting point is 00:33:44 Maybe it would be a good example. Like how far away do you think companies are from basically combining a knowledgeable model that knows about cars and car parts with their product data that they already have to help people navigate a site? What does that current landscape look like? Or what do you think maybe it looks like in a couple of years from now?
Starting point is 00:34:09 Yeah, it's very interesting. I think, you know, I think at the moment we've got still quite, you know, sort of, you know, early methods in terms of what we can do here in terms of understanding. And so at the moment, it's still very much, you know, sort of systems with different pieces and you'd like might have an embedding model that knows particular things, maybe a couple that with a language model that knows certainly different things. And so that's kind of the sort of current state of play. And, you know, depending on what you want that, you know, depending on how you want
Starting point is 00:34:32 the results to be sort of displayed or depend on, you know, what language model you might have on the outside, maybe just, you know, have results. And so if you're asking it to, you know, understand deeply about, you know, if you've got 10 results, then you're asking them to distill those results into an answer for a customer, the language model itself at some point has to actually understand a lot about the domain, depending on what it is.
Starting point is 00:34:54 It can't just sort of summarize it. It'll actually have to understand the differences. And so I think what we'll see, probably much more end-to-end training, for example, I think we're already starting to see this where the embedding models, everything is really informing each other. These aren't necessarily done in complete isolation
Starting point is 00:35:10 so that you can actually get this sort of system, which is domain-specific as well, not just the individual components. Because these things aren't, and they also feed into each other. So the results from one thing feeds into the other. And if you do have particularly, say, issues in one component, it just flows on through the system. And so being able to really optimize end-to-end, I think, is where we're going to be going. Have these systems,
Starting point is 00:35:33 language models and embedding models that can be optimized together. And then potentially, things like the storage component actually living inside the machine learning models, the large language models. And so going forward, the vector database would be much tightly integrated with the large language model, for example. Because back to what you were saying earlier, Eric, you basically just want to replicate that highly knowledgeable in-person customer sales rep or whatever, like that person you go and talk to at Home Depot that like used to be a plumber for 20 years. They know everything about plumbing. You can describe a solution.
Starting point is 00:36:12 They're like, this is what you need. Like, that's what you want from a search. But we're really far away from that. Yeah, it is interesting. You mentioned car parks. I was thinking I have a hobby of working actually on land cruisers, which are very popular in Australia. And searching for stuff is phenomenally difficult because even if you have a base model number, the part interchangeability varies pretty significantly in terms of sets of years.
Starting point is 00:36:38 Right. And so searching for parts is so difficult. And so you end up going through these forums and like coming through like message threads to be like, is this the right part number for like my specific thing, you know, which is wild. I mean, it really shouldn't be, it shouldn't be that difficult because I guess the thing that's crazy to me is all of that information exists and actually is like pretty available like interestingly enough it's not like it's a mystery it's just that as a human you have to go through
Starting point is 00:37:10 and like create these explicit relationships in your own mind you know that just haven't been combined from reddit forums yeah yeah yeah yeah and i think yeah i think that's what's most exciting about a lot of the current sort of wave in AI and machine learning is now I think we have much better methods to use a lot of this kind of data that exists, which is really relevant, this unstructured data. And so I think previously before we had to really anchor it and had to really define these relationships. But I think now we've got this ability to actually, if we know it's relevant, you can sort of start to incorporate that and actually learn from a lot of this other information that exists and actually now incorporate that into you know your say your search system so that's really powerful you know i think being able to use a lot of this other data which was previously really hard to use yeah one one other thing on this topic is we also had a ton of really useful data locked up in pdfs were from the manufacturer, that were manuals,
Starting point is 00:38:06 that were how-to guides. How would AI and AI search potentially unlock some of that data? Yeah, I think there's a few different ways. I think one of the things that we focused on with Marco was just that problem that you've got so much data which is unstructured and just basically inaccessible. You know, I think it's something, you know, your numbers, something like 80% of the world's data is unstructured data and it's, you know,
Starting point is 00:38:31 growing at an exponential rate. And so, you know, one of the ways we founded Marco was to think about, you know, the invariance. There's so much been changing in AI, you know, so many new models, everything's changing. So how do we build a business and how do we solve problems which are based on, you know, these invariants? And so we know unstructured data is a huge amount we know it's going to you know keep growing we know that people need to search it and we need you know relevant results
Starting point is 00:38:52 and so that's sort of one way that we've been you know thinking about you know the problem building marco and so you know being out you know vector search particularly allows people to search across this unstructured data in ways that previously were impossible. And so now I think as well, not only can you search across it, but like we sort of just discussed in terms of all this latent data, which exists in forums, we now have methods, not just to search that, but you can actually incorporate that into a domain-specific model as well and actually really understand that. Okay, Jesse, I want to dig into the technical side a little bit because John mentioned earlier that you can have
Starting point is 00:39:27 a packaged AI application and you send the data, it sends it back, it has its own embedding model, it has its own deep learning model, LLM, whatever. You talked about Marko as a system and now
Starting point is 00:39:43 with this latest product launch, which is very exciting, is you can bring your own, you know, first party data to help like inform, to inform the system. But you mentioned something earlier that I think is really interesting. And I think it's to be really helpful. I mean, especially for me, but I'd love for our listeners to like, walk away with a better understanding of the embeddings model and then the language model and then the things that you would want to customize on each of those. Or what are the separate concerns across the embeddings model and the language model that you need to think about? And then what does that relationship work like in Mario?
Starting point is 00:40:25 Yeah. So I think one things that we've done, particularly with the new product launch, is take quite a holistic approach to the way we build the systems and optimize these models. And so the first one really is making sure that the consideration is on the embedding side is that we can actually optimize it in a way that's actually going to mirror
Starting point is 00:40:41 what's being used in production. So it's not done in isolation in terms of what data is being used, in terms of how the data is structured. You start to see that people have particular, they might have reviews, they might have titles, they might have descriptions. We know that this is often used. Sometimes it's missing as well.
Starting point is 00:40:55 And so actually you've got to be robust now to start missing data. I think that's one of the key things as well. This sort of current paradigm in vector searches, you have one piece of information that's turning to one vector and you just search over that but of course that's pretty naive in terms of what actually you know customers and users have they have multiple bits of data they might have some they might not have others and so the first piece is really trying to optimize you know i think the models around you know what the actual customers need and have and sort of those use cases the data structures how it'll be used in the system. And then also, you know, customising the models as well
Starting point is 00:41:26 in terms of, you know, not just the data structures and how it will be used, but actually what are the outcomes that people want to actually do with this? Like, what are they actually trying to, you know, are they trying to improve a particular aspect of the business? And so being able to use a lot of that sort of domain-specific, you know, data to actually optimise, you know, directly for business outcomes.
Starting point is 00:41:43 And so that's sort of, you know, I think how we're thinking about it quite holistically in terms of from this sort of optimization perspective. So, and this sort of then plays into the different, like you mentioned, the LLMs and the embedding models, each of these sort of different things. So on the embedding side, it really needs to understand, you know, what are the, you know, how does it map,
Starting point is 00:42:00 you know, something that comes in as a query to the information that's in the database and how do you, you know, create that relationship in a way that, you know, it's going to retrieve the relevant results. And then on the outside, you know, something that comes in as a query to the information that's in the database and how do you, you know, create that relationship in a way that, you know, it's going to retrieve the relevant results. And then on the outside, you know, if you've got a large language model and these are used in many different ways, particularly in search, you know, both on the input and the output. And so on the input side, you know, you can use them for attribute extraction or data enrichment if you've got this, you know, data cleaning. And so in the game and on the output, you can use it to synthesize, you know, a set of results and actually try and reason over it and so again depending on what you want to do
Starting point is 00:42:28 on each of these things will depend a little bit on what you need from the language model you know how much domain knowledge does it need or simply need to be able to summarize or do they need to reason about that and so if you start to go beyond you know pretty simple sort of summarization and extraction then these language models themselves will have to become somewhat domain-specific and actually start to understand a lot of the nuance of the field. Makes total sense. Okay, so I have, this may sound like a funny question. Is there a, you know, we talked about a basic search index
Starting point is 00:42:58 where I know what I want, I know the keyword, so I search for the keyword and I get the result. That's great, right? And then we talked about some of these challenges that are much more nuanced, that are very difficult. You have these relationships that the user is not going to make explicit. You know, we need to infer a lot or learn from the inferences that we're uncovering. You know, what are the use cases where maybe vector search is not a good fit? Are there cases where it's like, well, you wouldn't necessarily use it for that?
Starting point is 00:43:28 Yeah, I mean, I think for very explicit sort of, I mean, I think part numbers is a good case there where you just want the exact match on a part number and nothing else. I mean, I think that's a great example. Don't infer a relationship. Don't infer anything. It's the auto parts desk. Yeah, exactly. Don't make a relationship. Don't infer the thing. So auto parts to ask. Yeah, exactly. Don't make a recommendation. That's right.
Starting point is 00:43:50 Where it's, you know, like a controller, I need to find this thing. But, you know, I think any of the sorts of shortcomings that they just search at the moment, you know, I think it really comes down to one is, you know, the model just not being appropriate for the particular use case, which is also exactly why we developed this new product where we can actually fine-tune embeddings on the custom domain because this really aligns the model with what the user intent is. But in the future, we're seeing an evolution as well. We're still in the early days of vector search.
Starting point is 00:44:20 We've got these mostly single vector representations of data, but of course, this is pretty naive. And so moving beyond just a single representation into sort of multiple representations and then having much more intelligent kind of query models as well, we'll be able to, I think, not only have the benefits of the keyword search, you say, but part number comes in, system knows, and this is the part number, and it should just return the exact matches versus question. Okay, now we can default to certain different behaviors. So, you know, in the future, I think we'll be able to absorb, you know, effectively, you know, all of these things,
Starting point is 00:44:51 you know, with vector search, and the system will know and understand exactly what, you know, what needs to be done. Yeah, I mean, that was kind of a trick question because I want a world where, like, the only search is, you know, is, like, well-tuned vector search because if you've experienced it done really well, it's so much better. It's almost just like, and I'm trying to think of how to describe
Starting point is 00:45:12 it. It's one of those experiences where you're just like, yes, like this is what it actually, like, I guess maybe it decreases the mental load so much and it feels so intuitive then it's just like like it feels like natural i guess and not anti-climactic you know but i don't know i mean what do you think you've done so much search but like that's the world that i think the thing that got me recently on this topic was let's see gpt's chat gpt4 has been out a, a little over a year. And just using like voice search for Siri or for Google, like it feels so bad. Like, like, cause if you're, if you type in GPT or use the like little voice thing, just for just somebody might normally, you know,
Starting point is 00:45:58 use like say, you know, okay, Google or Alexa or whatever. Like it's a markedly different experience and accuracy. So that was something that really struck me recently like, oh, like this was just you know, a year ago. Yeah. Yeah. And surprising that they haven't really improved, you know.
Starting point is 00:46:17 Yeah. Those two particular things. I guess that's coming out of that. Yeah. Well, I mean I mean, what do you think, Jesse? Like, they've got to be feeling the heat, right? Yeah. I mean, it mean, what do you think, Jesse? Like, they've got to be feeling the heat, right? Yeah. I mean, it's pretty interesting. I think, you know,
Starting point is 00:46:29 a lot of, and that's why I'm saying search is quite interesting because, you know, there's obviously some incumbents there, but, you know, it's a wave of AI,
Starting point is 00:46:35 you know, they obviously have to, they've got an existing business model and whatnot. And so, you know, they've got to sort of, you know,
Starting point is 00:46:40 they can't really hard pivot on, you know, at short notice. And so I think we're sort of seeing some inertia there and then obviously having to work out what's the you know take a big bet on in terms of where the future is going but i mean i think there's also a lot you know that we don't see which is you know they've got a particular business model they're optimizing for particular things and so that's also what you don't understand from a search
Starting point is 00:46:57 perspective is you know what's the incentive for their results that they're providing right like you know someone you know who's got a you know a web scale search right they're providing, right? Like, you know, someone who's got a, you know, a web scale search, right? They're going to be, you know, they live off ads. They're selling ads. And so the way they serve results is going to be dominated by these kind of business objectives. That's what the whole system is being optimized for. And so that's really one thing to consider as well, you know,
Starting point is 00:47:18 what the incentives are of the search provider. And that will dictate a lot of, you know, how these things are done. You know, do they deliberately, you know, sort of, know sort of yeah almost you know have different results yeah and i guess that's why i was like slightly optimistic about voice isn't nearly as monetized right as somebody that was like maybe they'll innovate there faster but yeah yeah yeah that's such a great point i mean yeah just the erosion of like quality search due due to revenue, just like web search is like, that's great. Well, we're really close to the buzzer here, Jesse, but I want to ask you, we talked about the hype.
Starting point is 00:47:56 We talked about there's this over bullets. We got a great breakdown of vector and like all the exciting things there. What other things in sort of the AI space are you personally excited about? I mean, you're building a company in this space, but what excites you as you look at the landscape? Yeah, I think it's really exciting to see, I think the evolution of large language models, particularly into different modalities. So, you know, large language models that are able to, you know, particularly just like you said, like even here, you know, C and also obviously use natural language, but then using them in a way that they can act as an agent or a
Starting point is 00:48:34 controller or a critic. And so you can now actually put these, you know, language models inside systems and they can make evaluations, they can route logic. And so actually, you know, this is something that's been quite powerful and really hard to achieve. Otherwise, you know, one of the great things about these models is that they and so actually you know this is something that's been you know quite powerful and really hard to achieve otherwise you know one of the great things about these models is that they've got you know this natural language interface it's it's kind of lossy and sometimes you do lose a lot of you know nuance with it but it's also incredibly good
Starting point is 00:48:55 at you know gluing together all these different interfaces so it's just sort of you know this layer this interface layer that allows you to connect audio language video into one thing and then it can go into a database for example so this you're being able to sort of unlock you know there's the language models with these different you know i guess sources of data and then being able to take actions or outputs is really exciting you know from my perspective so you can think about it from search perspective is you know you can have a system that could be optimizing itself right like you can have a language model you, it knows what good search results would be safe for your domain.
Starting point is 00:49:28 You know, it can literally send it off and it starts to just collect data, do search results, optimize the system. So, you know, moving into that direction, I think is incredibly exciting. Yeah. Yeah. I feel like we're going to enter this phase where when you talk about the history of, you know, using, you know, going all the way back to like the open source, open source is still elastic, algaline and all this sort of stuff. What's interesting is that was like a decade. You know, that was like an entire decade.
Starting point is 00:49:56 My sense is that the advance in search is going to do like a decade and a very drastically short amount of time. And I think it's because of what you're saying jesse yeah it's going to be very interesting i mean hopefully it does you know get better rapidly you know and everyone's using marco you know to say that we can avoid this you know the perennial frustration of searching yeah it's it's going to be you know fascinating to see how it goes and i'm incredibly excited as well just to add on to that last question about the future of interfaces as well we're seeing this evolve a lot and like you know vector search is really powerful and you know just this kind of idea of doing it punching in keywords a couple of keywords and then
Starting point is 00:50:33 sort of pressing enter you know if we can just sort of move away from those ideas and actually think about how do we interface with these search systems you know the search results i think will be much better the experiences will be much much better and so i think that's very exciting to think how can we actually leverage these new experiences with this new technology as well? Love it. And where I should have asked this earlier, where can people go to see Marco
Starting point is 00:50:54 try it out? Where should they go to check it out? Yeah, head to marco.ai. You can go on, you know, we've got live demos on the site. You can check it out there. GitHub, you know, we've open-sourced Apache 2 license, you know, so you can spin it up on your laptop. You can get going. We it out there github you know we open source apache 2 license you know so you can spin it up on your laptop you can get going with a really thorough you know sort of start guide you can you know build your first vector search you know experiencing
Starting point is 00:51:14 you know literally a couple of lines of code you know image search and you know really experience those sort of aha moments i think it's you know like you said when you first sort of experience it it can be quite magical and i think we've had some customers before when they go to first stop the end-to-end system and they started searching emojis then all of a sudden you know to return your cat emoji and they're getting pictures of the cat they're like oh my god this is amazing you know this understands you know this stuff you know and so yeah head over to yeah marco.ai or head over to the you know github which you can get to from the site cool all right well jesse thank you so much for giving us your time i
Starting point is 00:51:44 learned a ton and uh we'd love to have you back uh sometime in, Jesse, thank you so much for giving us your time. I learned a ton. And we'd love to have you back sometime in the future. Yeah, thank you very much. That'd be my pleasure. The Data Stack Show is brought to you by Rudderstack, the warehouse-native customer data platform. Rudderstack is purpose-built to help data teams turn customer data into competitive advantage. Learn more at ruddersack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.