The AI Daily Brief: Artificial Intelligence News and Analysis - Multi-On is What You Wanted AutoGPT to Be - Interview with Founder Div Garg

Starting point is 00:00:00 On today's AI breakdown, I'm speaking with DivGar. Div is the founder of Multion, a new AI agent that uses the browser to execute complex tasks. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Like, subscribe and share and go to Breakdown.network for more information. Welcome back to the AI breakdown. Right now, as I mentioned, I am traveling in Europe. And so as something a little bit special and different, I wanted to bring you a set of interesting interviews for that time when I'm going to be. be away. Today, my guess is Div Garg. Div has worked on AI at numerous companies including

Starting point is 00:00:38 Nvidia, Apple, Google, and Uber. He's an adjunct faculty member at Stanford, and he's now building Multion, which they're billing is the world's first AI personal agent. Now, if you've been listening to this show for the last couple months, you know that there has been immense interest in AI assistance and AI personal agents. As people have gotten a little bit perhaps disillusioned with things like AutoGBT, I've seen a number of people. people who have early access to Multion and feel like it was what they were looking for out of that project. Div and I talk a little bit about his background, about Multion's current capacities and how people are using it, and what he thinks the future of AI personal agents really is.

Starting point is 00:01:20 All right, Div, welcome to the AI breakdown. How you doing, sir? Great, yeah. Thanks for having me here. Yeah, no, I'm super excited. As I was just saying to you off air, I think you are building in one of the areas that people are most excited about. So I think it's going to be a great conversation. But before we get into Multian and what you're doing now, what's your background? How long have you been in AI? What's the perspective that you're bringing to this? I've been doing AI for the last almost six years. And I think I would say almost started around like my freshman year. So actually did a lot of physics in high school, did like a international physics column period. And when I joined undergrad, I was like,

Starting point is 00:01:55 what is the right thing to do in like in like, should I do physics, should I do something else? and seemed like physics was saturated and you needed like sort of like a different thing to make the jump to solve a lot of this like problems and it seemed like AI was the right thing to like do like I had the feeling that there'll be AI that will solve physics not just humans so like a combination of both sort of made the choice of like sort of like focusing on AI like since very early on when I started college and I've been working on like a lot of systems my first internship was actually working on an autonomous driving car back at Uber where a did a lot of benchmarking for them, like train their like 3D computer vision models for driving cars on the roads and detecting all the vehicles and like the pedestrians, cycles on the road and they're making that really safe as a system. And afterwards, did a lot of research

Starting point is 00:02:44 around how do you make autonomous driving like cheaper, safer? So we did like this research at Cornell where we got an autonomous driving car to just work with two cameras instead of a very expensive ladder sensor. And that time like ladders used to be like $70,000 more expensive than the car itself. we showed that you can do this a very similar thing just using like $100 cameras. And we got a lot of like media coverage on that, got a Forbes coverage, but a bunch of other

Starting point is 00:03:07 publications and just got me really excited about the potential of AI and how can you apply to the real world. So that has been my focus for the like the last couple of years. Worked in a like a bunch of like big companies. I was like a lot of like top secret AI projects. So I was like almost like the go to guys like oh like we have this like like I had an interest at Google. They were like, oh, there's this like the thing on computer vision. working on. We can't disclose you, but we like your details. We like a resume. You want the job? Yes or no? I was like, okay, sure. I had a similar experience in like Apple. So I've worked on a lot of this, like, sort of like AI, like very like security stuff in the big companies. And there was a lot of

Starting point is 00:03:47 fun. There's also a time like where AI was like research, but like it doesn't actually work as a product. So people tried a lot of things. I actually worked on like some AI devices. Like it's all like air kind of stuff in like Google back in the day. Did a lot of like a lot interesting reinforcement learning stuff in Apple, did some diffusion model stuff in NVDA. So that was fun. But there was no actual actual actual life applications. And after it joined Stanford, but I was focusing a lot on physical agents like robots, how can you make them more controllable and like how can you have more powerful algorithms that can learn from human data? So I would say like a lot of robotics hard currently works is you just have pre-programmed

Starting point is 00:04:23 loops. You just like program like script and it just like goes and execute that. A lot of my research thesis during my PhD at Stanford was like, how can you learn from human data? Can you like observe videos of humans? Can you like sort of like see what humans are doing? And like teach a robot or agent to like sort of like do similar things. So I created this algorithm around like learning from human videos. We actually won like the number one prize in a Minecraft AI challenge two years ago for like the New York conference in back in 2021. So we had this like agent that can like watch like 50 videos of human players building houses in Minecraft and like use that like sort of like go itself and then build a house. And so that was a video.

Starting point is 00:04:58 fascinating. Also did like a couple of projects on like how can you steer a robot using natural language. So we actually did like the first project around like how can you combine actions together with language to sort of like teach a robot to like control it using human voice. So you can tell a robot like, okay, like go open this drawer or like pick up a mug and can go and do that. So we did a lot of like explorations around just like physical agents and then like I was actually working at a robotic startup for a while in the air app for it's building a lot of their like algorithms and simulations and simulation. and afterwards, like, it just seemed like the right time, like, sort of like,

Starting point is 00:05:31 after chatypity came out, was like, okay, like, LLM as a technology is getting really good. And it seems like, we are reaching this phase where we can start communicating with AI agents because like before it was like sort of like you have a everything as a number, it's like a tensor. You don't really know what's happening. But now it's like, okay, like we're getting past this like a combination gap where like, as a human, I can tell it something and it can go and like understand that and communicate back to me what it's doing. And that just seemed like, okay, that was something that was missing.

Starting point is 00:06:00 And like, we're finally getting there where this can become actually usable. And everyone can go and steer this like sort of the agents. And so, yeah, so like the chat GDPD has been like a fascinating. Like as a technology revolution, I think everything was there. But just like it made like LMs more mainstream. Especially with GPD4, now you have such good reasoning capabilities. Yeah, so I've been very interested in like computer interaction agents recently, which is like sort of like multi-on, how can you actually take like language command,

Starting point is 00:06:27 which could be a voice or text, and actually translate that to real-life actions by controlling a human browser. And this is similar to how a human controls a website. So if a human can go and sort of like click, type and do everything, my thesis here is like we can train an AI that can also do this very effectively at the same data as a human. And so a lot of like the current approaches you will see

Starting point is 00:06:47 around like plugins and APIs, which is like, it's good. But the problem there is like, they're very restrictive. It's hard to like build APIs for everything. And it's almost like using a backdoor where like someone has to give you entry and expose it so you can go and use that. But like using like the browser itself sort of like a front door, like it's like every human is already doing it. And so if you can teach an AI to like just use the front door properly, we can like interact with anything on the web, potentially also like anything on the desktop. And so just seems like a very horizontal

Starting point is 00:07:14 and powerful way we can like reach a very powerful virtual agents. It's super interesting. It sounds like one common thread, you know, first when you were a student and then when you started to be, you know, in industry and building companies is a real interest in, you know, these tools moving from theoretical to actually usable, right, going and doing things. And to the extent that that's true, I think it's interesting that you found your way into this AI agent space, given, you know, we were talking about this as well, but there has been such excitement around AI agents. You know, auto-GPT ripped onto people's view at the beginning of April and had this real sort of, you know, deflation, I think a couple weeks later, partially because, you know, it was people who were non-technical

Starting point is 00:07:54 using very fast, non-technical implementations of it. But, you know, If I had to sum it up, it was basically what people were excited about was the idea that, one, an AI could figure out the steps to achieve a goal, and then two, it could actually do the steps. And I think what a lot of people found with the first implementations is that that first part happened fine. It's a great plan for how you would go do X, Y, or Z goal, but then there was no actual connectivity to actually going and executing. And it sounds like Multion's approach to this in some ways is, like you said, walking in the front door of the browser and trying to make sure that it can interact with, you know, a thing that we all interact with, which is, you know, the web browser.

Starting point is 00:08:33 Definitely. Yeah, so it's actually interesting. Like, for us, like, Merckon was actually in a very functional state back in February. And we just decided not to release it because one was, like, around trust and safety. So, like, if it just leads this to, like, a million people, like, things are going to go haywire and, like, how do you control things? Another thing was, like, we tested with some folks and they were, like, just, like, scared, like, they were very skeptical. Oh, this thing can go and control my computer. That

Starting point is 00:08:58 doesn't seem right. And so the interesting thing was, like, after auto-GVD, people became more familiar with agents. Because before that, like, people didn't know what agents were. And so if we talked to someone who's non-technical, they'll be like, just like, oh, what is this thing? Why is it taking control of my computer? What it's doing? But now, after-out, LGBT, people were like, like, okay, the agents exist and everything. So we'll say, like, it helped us, like, sort of, like, build this, like, sort of, like, clear out the space where people know, okay, like, what agents are, what can they do. do. And if we like now like sort of like give multi onto someone like they understand the

Starting point is 00:09:27 capabilities and we can make it into like a trust for the experience. So it's it's been an interesting journey because we basically started a lot before our GPT but have been like sort of like just trying to focus a lot on like how can we make more reliable, make more trustable, make more trustable, put like safety guardrails. And so that's been the focus that we have been had for the last three months in terms of like making better and better. And currently we are in this like close beta where we have currently around 100 beta users. It's mostly like in might only. We are trying to increase that to like thousand beta users over the next three weeks. And then we have like around 20,000 people on a wait list. So we'll be doing a lot of these launches

Starting point is 00:10:02 as we iteratively make it like much more safer and like get all the feedback from like how are people finding it. How can we improve it as an agent? Amazing. I think maybe at this point it'd be great if you're up for it to do a quick demo so we can see, you know, how multi-on works. Sure, definitely. So I can ask it something. Let's say something like if I say order a burger from ML and Palo Alto using Node-Dash, for example. Okay, I might have a login. So yeah, the login is one thing we take very seriously

Starting point is 00:10:33 because we want to make sure that people don't start misusing it for like building like bots and like spamming people and like doing like crazy things. So here the agent is like sort of thinking on like, so it's making a plan on what it should do. And then once that's created the plan, it starts searching. And here, like, you can, like, see what exactly it's doing.

Starting point is 00:10:55 And it will, like, start taking actions. So you can see, like, it said, like, it's clicking on, like, the first link and then, like, go and actually, like, start, like, ordering the thing. For people who are listening to this, because this will be on a podcast as well, there's basically a multi-on window in the bottom right hand of the screen that, as it sort of controls the browser, is explaining what's happening, right? So it says, I am clicking on the link to DoorDash page for the Melt in Palo Alto to proceed. Right. And in this case, like, it can actually ask me a question. So it asks me, like, do I want the Melt Burger or do I want a different one? And so this is a new feature we've been experimenting with where we gave it the ability to sort of like, ask different questions to a user present options. So if I say something like, I want the BB to Beacon Burger.

Starting point is 00:11:41 Yeah, and it's a bit on the slow side today. But usually we can work very real time. And then it's trying to find the burgers. and then can, like, find it and edit it to the card. And then it can also, like, automatically do the whole checkout if I ask you to do that. So it went to the checkout screen. I'll probably pause it here, but otherwise it can go and, like, buy the whole thing for me. Yeah.

Starting point is 00:12:00 Yeah, so much we used to having a random door dash out to show up at my house. That's amazing. When it's doing this, how much is it asking you to approve at different steps versus just doing it itself? Yeah, that's a good question. So we have, like, two modes we built. So one is like a sort of like a step-by-step mode where whenever it does something, it will ask you like, should I do another thing for you? And if you press like a hot key, so in this case, like the right arrow on your keyboard, it will take another step.

Starting point is 00:12:25 And so you can like control like each step, you can approve like, okay, like next step, next step, next step. And so that way it's safe. If it does anything wrong, you can like stop it and give it another command. The second mode we have like auto where it will like go and like do the whole thing. And it's almost like watching a movie or like seeing a video on your screen where like it's interactively doing things. And we have built this sort of like a pause button. So if you press the space key whenever it's running, you can pause the agent. And so it's like almost having a control to a remote where you can say like, okay, like do me this thing.

Starting point is 00:12:53 And then you press the play button and starts doing it. And then you can like press the pause button anytime and then or you can like give it a new command and then give it like a play again. So it's a very interesting sort of way to control a computer where you can imagine like in the future you might just need like a something like an Apple. If you've seen like this Apple TV remotes, which are like very minimal, just has like a. mic button and like a play post. That's all you need. You might not even even need like a keyboard mouse in the future. Yeah, it's super interesting. I can imagine, and I have no idea if your, if your tests validate this at all, but I could totally see at the beginning people almost using the step version as like personal training wheels or trust trading wheels because it's like,

Starting point is 00:13:37 I want to see how this thing works a few times. Because I bet a lot of people, you know, it's not like they've already adapted to a new interaction mode, right? They're experimenting with it. And so they'll naturally kind of step through. But I can imagine if you've done something two or three times successfully, then you just auto post it, you know, especially once you know that there's sort of a pause button that, you know, if it goes haywire, you can stop it. Definitely. That's what we've seen. We also have some users that are like, we just got bored and we started playing with multi-on because it was like fun to see what's doing. So we felt like a lot of those sort of things. We have also seen people like to widely share videos of it to the friends and they're like,

Starting point is 00:14:15 oh, like this computer is going and like automating itself. It's almost like a ghost in the shell sort of experience for folks like using it for the first time. Do you feel like you guys have a sense? Obviously, you're very early with the test. What are people using it for so far? And how does that compare to what you thought they might use it for? We have seen like a lot of people are actually using it for research currently for fetching information online. I've also seen people. using for like social media stuff where if you ask multi on like go wish happy birthday to my friends on Facebook. It can actually find everyone who's a birthday and like send them a happy birthday message, for example, or it can like find people on LinkedIn and send them a message. So you've seen people

Starting point is 00:14:53 like sort of those sort of workflows also on emailing. Another we have seen is like on ordering. So we've seen people like buying like stationary on Amazon or like toilet paper or something, for example, ordering chairs. So those are some interesting things you've seen so far. Also, Let me also do this as a demo. So this is a really interesting. So we also made a market on a bit streamlined for like scheduling. So it's actually very good at like creating like calendar invites. And it can actually automatically include your zoom link information.

Starting point is 00:15:25 So but yeah, if I want to like say like a meeting invite, that can automatically go add my meeting details and my personal like zoom link. And like sort of like save me like maybe like 15 to 20 interactions, especially if you're a power user who's like a takes a has to like book a lot of meetings or set something up. So let me do this as a demo. So if I say something like book two to three p.m. meeting tomorrow and in this case that's

Starting point is 00:15:53 that's mean my co-founder and say theme sync. So here the agent is thinking and it will create a plan. So it went to like the Google calendar and then can like start filling all the information. So it can automatically start putting everything. So it, like, edit, like, the emails. It also will, is I like my Zoom link here and sharing it out? And then, like, send the whole thing.

Starting point is 00:16:16 And then we're also trying to build, like, a user verification workflow where we first send something, it can, like, send you, like, a notification, like, oh, I created this meeting invite. Do you want to, like, does it look correct? Do you want to change something, for example? And if you approve it, then can, like, go and send the actual thing. So if I say, like, send, it will send this meeting back. Super cool.

Starting point is 00:16:34 Some questions on this. So there are two. two things that were correct that could easily be not correct that I noticed. One is it went to Google. Is that sort of is it optimized for assuming that people are going to Google Calendar? And if not, you sort of how do you change that? And then the second is it knew your Zoom in advance. So what is that, you know, how do you as a user sort of interact to teach at that?

Starting point is 00:16:56 So we have built this like sort of like a memory scratch pad feature where like you can give it like your personal details. So if you like say like, okay, this is my name. This is my address, stuff like that. And then like Milturn will actually know all of that and customize it. So in this case, I've told it what my Zoom link is. I've also given some instructions like, okay, like include my Zoom link and calendar invite. So I can give it any notes.

Starting point is 00:17:15 Almost like talking to like maybe like an assistant where like, okay, like this is some do's and don't do's. If I tell it my allergies, for example, or I can tell it like, okay, like what seats I like on a flight, like window versus aisle. And can I take that preferences into account and like take actions accordingly. So this is like a feature we're starting to build. And so that's how I, it knows my Zoom wink. In terms of defaults, I think currently we are hard-coding defaults where like suppose like it wants to make a calendar invite. We tell multi-on by default you should choose Google calendar over something else. Or like if you want to search for something, choose like Google over like Bing, for example.

Starting point is 00:17:49 And so this is like some choices we are made in the future. We can allow people to customize this. There's also like interesting partnership potentials where like if someone comes to us and we're like, oh, can you make us your default provider for like, say, like travel? Like for like say Uber, for example, or say, can you make us your default provider for food, for example? which could be like DoDash, and we could like use that for monetization. Totally. No, can you make the AI breakdown your default source if you're asking a question about breakdown news? No, I think it makes tons of sense.

Starting point is 00:18:14 So this kind of gets to a question that I was going to ask both broadly, but also in the context of your product specifically, which is, you know, what is your team's thesis about the future of this sort of AI personal assistance? Do you think that they're going to be super general with people using them for everything, just the same way they would use a personal assistant now? Do you think that they're going to get refined into, you know, you're going to have eight or ten of these that are sort of, you know, optimized for specific experiences?

Starting point is 00:18:44 Is there some combination or is it just too early to tell? I think, like, it's going to be a combination. What we've seen to is, like, people don't like interacting with 10 different services. So if I've had a choice and if I could just, like, interact with one agent or AI that could, like, do, like, like, 100 small things for me in a day, rather than having an AI that can only do one specific thing. So people really like the second modality more, especially around assistants, because they want something that can reduce the friction and a lot of their everyday small things.

Starting point is 00:19:13 So we have seen like there's a lot of space where there's potentially like one general agent that can like help a lot. It doesn't have to be really specialized. But as long as it's like really helpful and like a lot of like really small things. So we see like I think that's where like people really want like a AI system. And there's also space where you can have very special. specialized AI agents. Example could be like maybe you want to build like a travel agent. I'm taking a one-week trip to Italy.

Starting point is 00:19:38 That involves like for a human that might actually take like a one week to plan the whole trade, discuss with the friends, cordoned it on the hotels, call the hotels, schedule like Uber's, flights, everything. So that is easily more than one week to be weeks of planning. And you can imagine if there was a spatial agent that can just go and do this in like one hour or something. I think there's a space for those sort of spatial agents. Similar could be true for a lot of like research where like,

Starting point is 00:20:01 for a lot of complicated, like, say, finance research or legal research, you have to spend a lot of, like, months, like, doing all the groundwork, finding everything. And so I think, like, for a lot of this, like, complicated, specialized jobs, you can have specialized agents. But for, like, I will say, like, for everyday life, I think you need, like, some sort of, like, a general agent, instead of interacting with, like, 10 different agents, you want to interact with just one. It's fascinating.

Starting point is 00:20:23 I mean, I think that the other sort of thing, which might add some heft to that theory is you have to think that almost. every company is going to experiment with retrofitting how you interact with their service with this sort of interface. And it will almost mean that you don't need to go out and seek specialized agents because they're just going to live in the apps where you already are. So if you use Instacart, you know, I mean, Sam Alman has said this about ChatGBTGT plugins. It's why he thinks that they don't yet have product market fit. He said something to the effect of, I think a lot of these companies that think they want to be in chat GPT actually want chat GPT and them,

Starting point is 00:20:59 you know, which I think is an interesting insight. But that does leave this space for this sort of day in, day out interactions. I think honing in on things that can be done in a browser is a really interesting insight as a way to sort of limit what the focus is while still keeping it really broad. Definitely. Definitely. We also see this as like a in the future this could become an interesting layer or a platform where we could allow people to build like more like applications on top of multi-on and expose this is like it's sort of like action layer where if you want to go build like a very powerful agents for some particular use case you can have like multi-undo controller browser and do the heavy lifting and sort of like build like

Starting point is 00:21:39 experiences around that so I think obviously there's a lot of space like that which a horizontal like agent like ours could enable yeah super super interesting so what is next for you guys you know you're still still in a very early beta you said you're expanding that beta probably over the next three weeks. But, you know, what else is coming up on the horizon? Sure, sure. It's very exciting things. We actually close a closing a big funning round. So, we're starting hiring. So that's been great. We are actually organizing a hackathon this weekend. So we are organizing this like agent hackathon at the AJ house in Hillsborough. And it's almost like everyone in the air space is there. So we have Karpati coming to give the intro talk. And so it's

Starting point is 00:22:19 going to be very excited. And we'll be giving everyone who's attending the event, um, access to multi-on as well as like programmatic control. So we'll be giving like API access where you can like, so currently if you see like multi-on, you can like a user can go and like give it commands, but we will allow you to like programmatically give it commands. And then you can connect with link chain, you can connect it with like other things.

Starting point is 00:22:39 And then you can like build very powerful applications and use cases for the duration of the hackathon. And so we're trying to see that as an experiment on like what people will do with the sort of like agents that can actually have a lot of purchasing power, for example, can actually like do. a lot of interesting things on the internet. And so we are also like trying to make sure like the event is safe and like we can moderate it.

Starting point is 00:23:01 So like people don't start doing like malicious things. So we have already been like banking websites, for example, so you can work on them. And we will be like, it'll be like a fun experiment. But I think it'll also be like very different from any other hackathon. Yeah. No, that's that's fascinating. There is, I just today did a part of the show about this idea for a new Turing test that came from, We suffer from Deep Mind and Infliction where he said,

Starting point is 00:23:25 the new Turing test should be give an agent or an AI $100,000 and see if it can turn it into a million. So maybe someone will get a head start on that this weekend with the Multian hackathon. I will say it's almost like a superhuman test because if an agent can do that. I know. I know, exactly.

Starting point is 00:23:41 I know, exactly. Well, listen, Dim, I really appreciate you taking some time today. I'm very excited to see what comes to the hackathon. We'll definitely share what comes out of that on the story. Yeah, that was great. Thanks a lot for inviting me.

The AI Daily Brief: Artificial Intelligence News and Analysis - Multi-On is What You Wanted AutoGPT to Be - Interview with Founder Div Garg

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.