Latent Space: The AI Engineer Podcast - SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Episode Date: December 18, 2025

As with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!)From SAM 1’s 11-million-image data engine to SAM... 2’s memory-based video tracking, MSL’s Segment Anything project has redefined what’s possible in computer vision. Now SAM 3 takes the next leap: concept segmentation—prompting with natural language like “yellow school bus” or “tablecloth” to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio:SAM can now even segment audio output!We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception.We discuss:* What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like “purple umbrella” or “watering can”* How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly* Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and “fast mode” tracking* The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity* The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2* Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale* Architecture innovations: presence token to separate recognition (”is it in the image?”) from localization (”where is it?”), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking* Building on Meta’s ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2’s memory-based tracking backbone* SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like “find the bigger character” or “what distinguishes male from female in this image”* Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples* Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more—MSL FAIR team* Nikhila: https://www.linkedin.com/in/nikhilaravi/* Pengchuan: https://pzzhang.github.io/pzzhang/Joseph Nelson* X: https://x.com/josephofiowa* LinkedIn: https://www.linkedin.com/in/josephofiowa/Full Video EpisodeTimestamps00:00:00 Introduction and the SAM Series Legacy00:00:53 SAM 3 Launch: Three Models in One Release00:05:30 Live Demo: Concept Prompting and Visual Exemplars00:10:54 From Prototype to Production: The Evolution of Text Prompting00:15:45 The Data Engine: Automating Exhaustive Annotation00:14:10 Real-World Impact: 130 Years of Humanity Saved00:25:11 Architecture Deep Dive: Decoupled Detection and Tracking00:28:02 SAM 3 Agent: Bridging Vision and Language Models00:33:20 Head-to-Head: SAM 3 vs Gemini and Florence00:47:50 Video Understanding and the Masklet Detection Score00:20:24 Fine-Tuning and Domain Adaptation: From Waymos to Medical Imaging00:52:25 The Future of Perception: Native Vision vs Tool Calls01:05:45 Building with SAM 3: Roboflow's Rapid Auto-Labeling00:57:02 Open Source Philosophy and the Path to AGI00:58:24 What's Next: SAM 4, Video Scale, and Beyond Human Performance This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:03 Okay, we're here in the remote studio with the grand return of the Roboflow and Layton Space and Sam combo. Welcome to Joseph, my sort of Vision co-host, I guess. Thanks. Great to be here. Welcome back. We also have welcome back, Nikola Ravi, who's the lead on Sam 2. I guess just Sam in general, right? And we have joining us Peng Quan, who's also a researcher on Sam.
Starting point is 00:00:27 Yeah, nice to meet you guys. So congrats on Sam 3's launch. I mean, like the demo, each time you, you set it up, like, really amazingly. And I think, like, every time, my general impression or takeaway when I tell people about Sam is, like, just the, every time you have a new release, like, it's like, once a year you show up, you drop a banger and then you, you, like, you know, you just, like, drop the mic and and go for next year.
Starting point is 00:00:50 And you also add a dimension. So I was entirely, like, weirdly not surprised when Sam 3 had the 3D thing. Because I'm like, well, yeah, which is the next dimension to go? It's like 3D. Actually, maybe just on that, I think that's actually a common misconception. We launched three separate models this time. It was SAM 3, SAM 3D objects, and SAM 3D body. Yes.
Starting point is 00:01:16 Those were two completely separate models, and SAMH3 is just the image and video understanding model. Which is on a deader backbone and is sped up. Yeah, sorry, I didn't mean to sort of pre-face all this. But maybe just to remind our audience or maybe for people new to the Sam series of a podcast that we've done so far, maybe each of you can sort of go around and intro like your or your sort of entry into computer vision or sort of your relationship with Sam. Go ahead, Nikki. Okay, cool. Hi, everyone.
Starting point is 00:01:46 I'm Nikila. I'm a researcher at Meta. I've been at Massa for eight and a half years. So really been through evolution of the field in that time. It really started working on a range of different problems in computer vision, worked briefly on 3D. We bought this library called Python 3D. But I really started on this segment anything as a project in around sort of late 2021. So it's actually, you know, been almost four years since I've been like working on this segment anything space.
Starting point is 00:02:20 And, you know, we started with Sam 1 in 2023, Sam 2 last year in July 2024. and then now Sam 3. So it's been a combination of a lot of work of a lot of people over the years. So yeah, really, really excited to be at this point and, you know, get to share it with all of you. I'll hand it over to Pengtron. Yeah. Hello, everyone. So I'm Pengtran.
Starting point is 00:02:45 I'm a researcher at the Sun team. I have been working in computer vision this field for nearly nine years, starting from 2017. I think it's a long time. I have been working in MSR for five years and then move to Meta-Reality Lab to work on egocentric Foundation models on AI glasses for a while. And then in 2020, I moved to Sun Teme, and that time is exactly the start town of San Slui. And Reruneg and I think that's the lifetime experience I have on the Sansloui team. And it's glad that Sons Lui is out and I kind of achieve my original grand goal of
Starting point is 00:03:26 computer vision to reach kind of human performance of detection, segmentation, tracking image and videos. I'm Joseph co-founder, CEO at Roboflow, where our mission is to make the world programmable. We think software should have the sense of sight, and models like Sam and others are critical to unlocking that capability. Now, millions of developers, half the Fortune 100, build with Robaflow's tools and infrastructure to create and deploy models to production. We've been big believers of the meta family of open source models, all the way
Starting point is 00:03:56 back to like mask R-CNN and Detectron 2 all the way to presence of Sam 1, Sam 2, and Sam 3. The work that the meta team does to advance state-of-the-art and open-source computer vision has been bedrock to enabling developers and enterprises globally to adopt AI. So we've been big fans of the work, and I'm pleased to be joining you today, Swix, to co-host the episode on Sam 3. And you guys shipped your own debtor model, too. Yeah, we've been doing some work to. advanced machine learning research too.
Starting point is 00:04:28 Like one of the, for example, debtor detection transformers, which was born out of NERIPP's last year, I think SWIX you actually challenged us. You were like, hey, what are some of the advancements that are happening in computer vision and in visual AI? And we had this observation that transformers had surpassed a lot of CNNs in vision
Starting point is 00:04:44 tasks, but they hadn't been made to run real time, as in over 30 frames per second, for example, on like a small T4, or excuse me, small like edge device in hundreds of frames per second on like a T4. we did some research and published RFDatter, RobloFlow Detection Transformer,
Starting point is 00:05:01 which is, you know, we kind of joke the greatest of all time model for doing real-time segmentation and object detection on the edge. Now, in RFDetter, it's, you know, you have to have a fixed class list and need to know some of the objects that you want to segment at a time.
Starting point is 00:05:15 But for anyone that's running on, like, constrained compute and on an edge device and wants like an Apache 2 model to do that, RFDetter and its family of models are key to fulfilling that mission and that goal. Yeah, amazing. Okay, I think we are going to just go into a SAM3 demo. I think Nikki, you've prepped some stuff to show less. And this is great because obviously there's nothing better than the creator of the tool showing off the tool. So just to start with like, what is SAM3? So SAM3 is a model that can detect segment and track objects and images and videos using what we call concept prompts.
Starting point is 00:05:52 So I'm going to start with a simple image example. then we'll show you a video example. So a concept can be anything that is a short text phrase. So here, for example, we can use something like watering can, and you can see the model predicts a mask for the watering can. You can also then refine the prompts using clicks or additional visual exemplars, which I'll show you in a different image. But essentially, the idea of a concept prompt opens up the ability to find all instance,
Starting point is 00:06:26 of an object category without having to manually click on every single instance, as you would have had to do if you were using SAM2 or SAM 1. Now, if the model misses any of the instances, you can add visual exemplars. So a visual exemplar is also a way to describe a concept to the model. So here I can add a positive box here and show the model that this is all. also an instance of a flower that we want to detect. So this is just an images, but what's really cool is you can now also do this in video. And so here I'll show you an example.
Starting point is 00:07:08 Maybe this is a football match. You want to track all the players in white, for example. So red jersey or white jersey, you can provide a concept prompt. And the model will find the objects in the first frame and then track and disqual. detect the new instances that appear later on in the video. So it's not just detecting on the first frame, but both tracking those detections and finding new instances that appear throughout the video.
Starting point is 00:07:39 And one of the things we love to do in our demos is also show some real-world applications of this, and so one idea here is that you can use this for video editing or adding effects. So here it was a really simple mask effect, but you can imagine, for example, you might want to add a trail around the players. You know, you can follow them around. Maybe you want to clone them.
Starting point is 00:08:05 So you've got multiple players running around. You can also do background effect. For example, spotlighting players. And so these are just fun things you can do on top of the SAM3 outputs. And this is just like a way to show people like what you can do. There's also some templates which basically are pre-populated. with the text prompt and an effect. And these are just some fun ways you can use the outputs,
Starting point is 00:08:30 but really, you know, the crux of it is in this, like, create from scratch where you can upload any image or video and try Sam 3 on that. And we'll share the link so you can try it out as well. One of the other demos that I have is like a busy scene for like doing labeling, which we can do later on, but just to give you a preview. It's like if you want to find tablecloth and maybe like back there, there's like airplanes. so I'll do airplane,
Starting point is 00:08:57 and you kind of get the ability to start to... Do you find the confidence thresholds? They do. I don't know why tablecloth wasn't as good. I've used that one in the past. Table, maybe? Yeah, cool. Wow, look at that.
Starting point is 00:09:12 Yeah. I think the other impressive thing that you guys emphasize in your launch is also like the latency. I don't know where this particular inference is running, but it says something like, Sam 3 runs in 30 milliseconds on single image If I want 100 detected objects on an H200 Obviously this isn't an H200
Starting point is 00:09:31 But it's also like just impressively fast And sometimes basically you can be real time if you want Yeah definitely on images On images it's really fast And then on video it kind of scales with the number of objects But it's for limited number of objects It's still real Yeah
Starting point is 00:09:49 Also add even for video If you can't afford the kind of GPUs, pretty kind of very kind of parallel influence, I agree with them. So even you have a lot of object to track, you can still get real-time tracking performance as long as you scale up the GPUs there. So I'm reading in the paper, it's 10 objects on 2-H-100s, 28 on 4-8-200s and 64-on-8-200,
Starting point is 00:10:13 something like that. I don't think there's an architecture. I don't know if this is the parallelism demonstration that we're talking about. Yeah, in fact, when you, when you can try the demo, the video to the kind of parallel implementation of the video guanning. So it's already in that fast mode. Yeah, you try it with a video with like lots of objects,
Starting point is 00:10:36 and then you can notice that it's actually not very slow, and you get the sense that we are doing the multi-GPU inference. Yeah, everyone should try it out and see for them. So, okay, amazing. So this thing about concept segmentation, I feel like you had a prototypical version of this. And in your paper, you really talk about like sort of generalizing it. I guess like, what was the planning like in Sam 3? At the start of this, is what we have today exactly what you planned for?
Starting point is 00:11:08 Or did you kind of, did it emerge as you discover capabilities? Maybe I could quickly talk about, yeah. In Sam 1, we did have a proof of concept of text prompting. But that was just a very early expiration. it wasn't really built out and, you know, became the most highly requested feature since then. And so we, you know, in Sam 3, we really wanted to do it properly
Starting point is 00:11:30 and actually do this in a way that it works in all different scenarios. And so we had to really think about how to formulate the problem. So it could have been that we took open-ended text input and it works for all open-ended text, or we could have be more focused, which is what we chose to do, and really focus on these atomic visual concepts like yellow school bus or a purple umbrella
Starting point is 00:11:56 and really focus on nailing the problem for these like atomic visual concepts. But Pengtron, maybe you want to talk a little bit about kind of the benchmarks that existed previously and how we had to actually fully redefine the task and the benchmark that we wanted to solve. Yeah, and maybe just to add to Pengtran's point, like if you look at the size of these benchmarks. The previous benchmark, Pentagon mentioned Elvis that everyone uses. It has about 1.2K unique concepts and the benchmark that we created, which we're calling Segment anything with concepts or SACO, Coco, for short. SACO has more than 200,000 unique concepts. If you think about the natural language that people use, we don't just use a thousand words we use. We have a very
Starting point is 00:12:43 large vocabulary and we really wanted to build a benchmark that can capture that diversity in size. Yeah, it's, it's really impressive and also like very formulaic, I guess, or classic that every great model works starts with a lot of data work. I think basically is, you know, the scaled-up version of the same process for Sam 2. Yeah, in some ways, I think the in Sam 3 data engine really was like a very novel and critical component. I think, you know, to your point, a digestive advantage in AI is not just about the models, but really about the data,
Starting point is 00:13:21 and maybe even more so is actually the data engine to generate that data. And we put a lot of effort in SAM3 specifically to try and automate that process a lot. One of the things that we're really impressed by is the diversity and depth, as well as breadth of uses that we see with models like Sam in production. Basically, when you think about computer vision,
Starting point is 00:13:44 you know, folks kind of like always classily, they think about like dogs and cats and simple sorts of things. And the reality is, like, computer vision is where AI kind of meets the real world. So any sort of thing that needs to be seen and understood, you need to have understanding of that thing. So a model like Sam, expanding the concepts from like, you know, a few thousand closed-form concepts max in a single model
Starting point is 00:14:04 to tens of thousands of concepts means that you're going to see such a huge acceleration of the number of fields and applications of the model. So this is SAM3, right? So we've already seen and measured some of the impact of the SAM family of models. And we pulled some of the updated stats on how impactful Sam is being across the RoboFlow community. I think RoboFlo might maintain one of, if not the largest, hosted instances of Sam. And we've seen basically 106 million kind of smart poly created examples that are Sam 1, 2, or three powered. And we estimate that that saved humanity collectively, like 100, maybe 130 years,
Starting point is 00:14:44 depending on exactly how you want to do the calculation of time, just curating data. And each of those use cases, right, isn't dogs and cats on the internet. It's things like, I don't know, we see medical labs across the world that are accelerating cancer research by doing things like counting and identifying the automation of neutrophils after a given experiment. Or we see folks that are using aerial imagery for things like helping a drone navigate for the world, or maybe counting and seeing solar panels from above, or maybe even doing like insurance estimates. We see folks that are building underwater trash cleaning up robots. So like you can imagine an autonomous underwater bot that's navigating through the Pacific
Starting point is 00:15:21 Ocean and identifying and grabbing on and grabbing plastics and cleaning up the world's ecosystem. Relatedly, we've seen some work with aquariums across the U.S. like Embari, who are doing work for keeping track of species and identifying the impact of ensuring given steps that are taken or increasing the populations of given fish with like underwater fish cameras. We see folks in industrial settings like doing work to produce electric vehicles or get products from point A to point B. At the time I'm recording this, it's like near Christmas time and it's like high time for holidays for folks that are doing gift giving. And that ends up being really, really high time for making sure goods and services show up where they're supposed to be at the given point in time. One of the statistics is a
Starting point is 00:16:03 that we track is the frequency with which folks cite works like Sam or Robaflow or blogs that we publish. And there's now basically like a little over two research papers published every day, citing some of the work across like the Robafel community. And that's folks that are like publishing in nature and science direct and a fairly prestigious number of journals. And each of those, you got to think about it. Each one of those publications is someone's like seminal work, often six, 12, 24 months of effort that's been accelerated from models like Sam. So it's not an exaggeration to say, like, models like Sam are speeding up the rate at which we, you know, solve global hunger or find cures to cancer or make sure
Starting point is 00:16:46 critical medical products make their way to people all across the planet. And at the infrastructure level, we're, like, thrilled and surprised constantly by the breadth and depth of adoption that we see from the community. I mean, in the first five days of Sam 3, there was like eight million inferences of folks that were running across all diverse sets of fields. And that's actually only increased because it was released and then there's like Thanksgiving and now it's back and folks are like hitting it pretty hard. So it's been incredibly encouraging to see the both depth of adoption and how much the community takes and uses and relies on models like Sam and prod. Yeah. And I think from maybe just to add to that from like meta side, like we don't usually get
Starting point is 00:17:22 as much visibility into all of these real world use choices. They're, you know, being able to kind of hear that from RoboFloor and having these models available on the platform is like so valuable for us. It's also, you know, we get to know how these models actually work in the real world, which is, you know, ultimately the best eval for a model. So I think, you know, it's definitely awesome to hear about all these things that we're empowering. Nicola, you had this, you had this comment of like the best eval for a model is like, it's not necessarily benchmark. What was it? It's like if it works on real world things. I think it's a really good soundbite. something like the best e-vowel is if it works in the real world.
Starting point is 00:18:01 Yeah, true. And that's like the ultimate goal for all of our models, like Sam 1, Sam 2, Sam 3, we want people to use it out of the box as much as possible. And I think, you know, with language in Sam 3 specifically, there does need to be, in some cases, some domain adaptation. But we have sort of tried to make that easy. I don't know, Pengtron, you want to talk a little bit about that, like the fine-tuning aspect?
Starting point is 00:18:28 I wanted to also endorse like the real-world thing. I was just so happily surprised when I was visiting the CZI Imaging Institute in preparation for our pod with Mark that they were using Sam in imaging the human cell.
Starting point is 00:18:45 And they showed us like how in reality all these sort of masses are actually like really undifferentiated and it's really hard for the human eye to track. This is actually a simpler one where you can actually there's not, This is like pretty clean here. In reality, a lot of it is just like, just gray mush.
Starting point is 00:19:00 And you have to like segment individual lysomes out of these. And they showed us how they were using Sam and fine-tuning Sam to do it. Yeah, really, really, really complicated and also like very meaningful, right, for, for basic science research. And I also maybe mentioned like this in the paper, the distribution, you can actually see what SACO does. So a lot of animals, a lot of animals. And then very surprisingly few maps. I'm like, maybe there should be more maps. I'll say Huggings face has been doing a lot here and other companies.
Starting point is 00:19:36 Yeah, this is actually one thing. Something we get asked a lot is like, what's the minimum amount of data I need to fine tune? And, you know, being able to do that with just sort of 10 data points is hopefully we'll unlock a lot more than we can do ourselves. Yeah, I mean, the more the merrier, obviously. This is where ablations are really helpful. You probably didn't have any fine-tune oblations in here. I think this is all data and model training oriented. But yeah, I mean, like very very clear.
Starting point is 00:20:02 I just have a cheeky, curious point. Is there a ratio of what is the ratio of the negative example to positive example, right? So in Nicola's example, when you were demoing just now, you only selected positive examples. Obviously, there's going to be a lot more negative examples of not class than positive example of class. So there should be some exchange ratio where like negative examples contribute smaller than a positive example or is that not the case? For positive and negative examples, I don't know that I have seen like a golden ratio that that works well or not works well, but I can't offer anecdotally that a single negative
Starting point is 00:20:38 example goes a long way. A common place where fine-tuning is really helpful is like data that's out of distribution that might, might have been impossibly in distribution. Like one of my favorite fine-tuning examples is like counting Waymos. There's not that much data that have waymos labeled throughout the streets of San Francisco, but Sam does a really good job to identify Waymo as like a vehicle. If you prompt with Waymo, it doesn't find anything, you find vehicle, it labels a Waymo as a vehicle, which is valid, but a Waymo is a specific type of vehicle, right?
Starting point is 00:21:08 Usually, from even just like a 10-second video clip, you can actually start to have Sam 3 learn what should have been seen as a Waymo versus what should have been seen as a vehicle. And even on a single image example, we see that like Sam 3 starts to adapt because it takes the text and image prompt into account when it makes a subsequent inference. From like three to five negative examples
Starting point is 00:21:30 alongside positive examples, you start to see the model update its priors, if you will, for where it would predict things from what the user provided. All this is written with caveats, right? Because like when you talk about visual world, the negative example and the positive examples could have been a very different perspective or a very different type of object.
Starting point is 00:21:47 Like maybe you're like labeling dog breeds and suddenly a new dog breed appears or maybe you have a perspective where it's overhead and then suddenly you have a side by side view. So usually the best way is to like have these things meet the real world data and try. But I'll offer maybe the note that a small number of negative examples, so it was a really long way, like small like three to five, not like hundreds. Yeah, the other place when negatives play a big role is just is it in the image or not. and that was one of the things that we did was really separate the problem into a recognition problem and a localization problem. So first, can you answer the question, is this object or
Starting point is 00:22:26 is this concept in the image? And then if it's in the image, where is it in the image? And so to really to really build in that capability, we had to annotate a lot of negative phrases in images. So basically a lot of phrases that don't exist in the image in addition to the concepts that exist in the image with the corresponding mask pair. So we have, you know, if you look at one of the tables in the paper, which shows the training dataset distribution, I think it's table 24. We have about 70, more than 70% of the annotations are these like negative phrases that are not present in the image. So we have to really train the model to not detect stuff that is not in the image. Yeah, I think that the separation of localization and,
Starting point is 00:23:21 it's basically precision recall, right, but in the vision domain. We basically add this presence token to the model, which explicitly separates the task of recognition and localization. So basically it simplifies the task. And so the model doesn't have to try to do everything with just the proposals in the detector, be able to have this global like sort of learned token just for the recognition part. Yeah. In general, I find that you guys did a lot of extra net new work.
Starting point is 00:24:06 You had a really nice chart in here about like the yellow boxes being like the new stuff. forget where. Yeah, the architecture diagram. Yeah. I'm like, holy crap. Last time it was like, you know, there was like the memory stuff. This is Seb 2.
Starting point is 00:24:23 And here it is all this. Obviously, you know, it's hard to cover it all, but you know, I wonder if there's any other interesting stories or tricks like the presence token that you might want to focus on. Yeah, I mean,
Starting point is 00:24:39 this is nice. this diagram, I'm glad you brought it up, because Sam 3 isn't just a version bump. It's, you know, an entirely new approach to do segmentation. It's like this new interface for segmentation, and it combines so many different tasks where previously you would have needed a task-specific model for each of these tasks. You know, interactive segmentation, text prompting, open vocabulary detection, tracking, like all of these tasks you would have needed a separate model. So it really had to do a lot of work to bring it together.
Starting point is 00:25:15 I think one of the things we did was really decouple the detection component and the tracking components. So you can see, you know, we still preserve the tracking components from SAM2. But the detector is separate. And the reason we do this is, if you think about what a detector has to do and what the tracker has to do, the detector needs to be identity agnostic. So if you have a concept dog, it needs to be able to find all instances of that dog, and it needs to sort of have this representation of dog that is the same for all dogs.
Starting point is 00:25:53 But when you're tracking those dogs through the video, each dog needs to have a separate representation such that we're able to preserve the identities. And so there's this kind of task conflict that emerges between the detector and the tracker. And so we really had to, you know, we experimented a lot, we really tried to build kind of a unified approach to do things, but then what we found was having the separate detector and tracker really worked. But we used the perception encoder as this shared visual background. And there's sort of a text and image aligned encoder.
Starting point is 00:26:30 You can see the green boxes there, they're from, it says from PE. That's perception encoder. That was also from. our group in Bear at the time. This was released earlier this year in April. And so this really is bringing together components from like the entire fair and meta ecosystem. We have perception encoder. We have a deep detector.
Starting point is 00:26:54 We use SAM2. We also use Lama and our data engine. So we really like using all the components from. Yeah, it's like any third film in the trilogy. Like you always see like the previous recurring characters come back. Yeah, well, it'll work. You've got to continue using it. And to connect to something we just discussed earlier,
Starting point is 00:27:14 you mentioned that at video component, each object needs to be tracked independently. That's why the compute scales linearly with the number of classes, right? Because each of those instance types needs to be maintained. Each of the scales with the number of detected objects. Yeah. So, for example, like each dog that appears in the video, each one of those needs to be tracked independently.
Starting point is 00:27:33 There was something else that you started to allude to in the paper that I was hoping we would spend some time discussing, and it's interaction of Sam 3 and LLMs, Lama, and others. So using Sam 3 to almost be like a tool call for LLMs to give them better grounding and give them better visual understanding. And there's a paper in the table where you describe the increase in performance. It's kind of alluding, I think, to maybe where things are going for using Sam 3 as a component part of multimodal architectures.
Starting point is 00:28:01 Do you want to describe a bit about what the introduction of that work was meaning to showcase and how the interaction of Sam 3 and LLMs is envisioned to be important. Yeah, maybe I can just do a quick intro and I'll hand over to Pengtron to do the deep dive. But essentially, as I mentioned, Sam 3, we constrain the text input to these atomic visual concepts like yellow school bus or yellow watering can. But obviously, people want to interact with the model of natural language and we want to enable that as well. And so that really segues into being able to use SAM3 as this visual agent for an MLLM. And so I'll hand over to Pengtron. Maybe you can explain about the SAM3 agent setup and then talk through some of the results that we got there.
Starting point is 00:28:52 Yeah, yeah. So as you cannot mention, the big picture is that Sansaulet is focused on this kind of atomic concept. But people definitely want to try much more complex places. like, okay, and going to produce locates the bigger kind of character for me. For example, this can nine example, what is the kind of, the feature that distinguish male and female in this picture. Then these are more complex language. This is exactly going to sense three cannot do, but sensory agents target to solve. In this case, you can see that it needs much more advanced.
Starting point is 00:29:34 language understanding and reasoning. The sensory currently do not have this kind of capability because it's small language encoder. But we know that large language models definitely have what's trained on a lot of this data and
Starting point is 00:29:50 has this kind of word knowledge and the reasoning capability. The sensory agent is exactly using sensory as a I for the large language models to solve this kind of complex visual grounding tasks.
Starting point is 00:30:07 Is there any sort of insights that you or surprises that you have other than, I guess, like, Sam is a very good tool? Is that the main conclusion? Go to Table 8 in the paper as you described this, if you don't mind. Table 8. Okay. Yeah. Yeah.
Starting point is 00:30:27 Go to go. Yeah, please. Maybe kind of quickly reply to kind of sweetly. SWIx question. I would see that first, besides that sans slu is really a good tool, kind of provides the eye for large language model. The other thing we definitely found is that Sunsui is not perfect. It's not like kind of as robust as kind of human eye. Then language model also kind of helps to correct the sound error. They have a synergy between each other instead of just, okay, large language border provides the brain and the sensory provides the eye.
Starting point is 00:31:02 Interestingly, you use number four. I saw you, there's a mix of Lama 3 and number 4 here. But it looks like it does best with Gemini 2.5, which makes sense given this comparable set of MLMs. I'm just like, I think like the baseline also is just that like, well, what extra addition does this add on top of just the NLM? I would maybe like want to do that publication. Maybe you've already done it somewhere. What do you mean by additional thing? So it's basically like without a tool call, there's some native capability.
Starting point is 00:31:32 inside the MLM itself. Wow. In fact, that's a really kind of good question. In fact, I will are going to review even ask that question. So without, you can imagine that without large language models, without VOM and kind of says slowly, only for kind of reason sake,
Starting point is 00:31:54 it only achieved about, kind of on the validation set, if I remember it correctly, it's only achieved kind of 30, kind of numbers there. And also, it's very intuitive. You can see that for reason sake, it has this kind of
Starting point is 00:32:09 short lung untested. It has kind of different subset, short non. Short, then it's very close to sensory training data. Like, it's kind of atomic phrases, short phrases. None is this kind of very complex reasoning. You will see that
Starting point is 00:32:25 for short, sensory only, it's very close to the sensory agent. But for lung, the gap. is sold large, which indicates that, okay, that is exactly the capability. That's not a language model bringing. Got it. I can show an example here that might be insightful too.
Starting point is 00:32:43 Go for it. So even comparing like Sam 3 and Gemini, let's say that we just want to have them do like an object detection task here of finding here we're going to prompt with a speedometer and RPMs, and we're going to ask for things like indicator light, number, and needle. And if we run Sam 3 head to head with Gemini 3 and Florence 2, almost as a baseline of where things have been, and we see each of the results. First things first, you'll note that the speed of inference of Sam 3 is quite quick.
Starting point is 00:33:17 This is just calling the Gemini 3 Pro API, so whatever is provided from hosted compute is sort of what you get on the response time. And then the second thing you'll note is, in addition to speed, is some of the accuracy of results who might get it, we might have a timeout error, let's see. Do you have Elo scores? What scores?
Starting point is 00:33:38 Elo scores, like, Elo, yeah. Yeah, you had the arena. Okay. I was wondering what the EO was, because you said you were blind testing this. Yeah, that's actually interesting because we had blind tested SAM3
Starting point is 00:33:52 before it was released not a SAM3, just for people to try and compare I think we called it like a potential SAG or SEG preview or something. And we allowed users to vote and they kind of unanimously voted for what they didn't know at the time was SAM 3. We actually got like emails of people being like, hey, like where can I use that? And we just sort of ignore them until the model came out. But so here with the responses, you see that the grounding capabilities of Sam and Sam 3 compared
Starting point is 00:34:27 to even Gemini are out ahead currently. So not only is it doing grounding, but if you look closely, you can actually see it's making segmentation masks too, whereas Gemini III struggles, it just does detection by comparison. And then the other thing is just the richness of detections, like the recall is as high as well as the precision.
Starting point is 00:34:50 And if we compare here, it does it almost as well, right? But you see that it misses some of the numbers and has kind of these, some of these erroneous boxes that it's that is predicted. And then it also doesn't do segmentation. So it just does detection of the task. So you can envision that the same way the Sam 3 paper introduces the idea of using Sam 3 in tandem with MLMs. I would expect that to be the case pretty soon. And maybe the Google team taking some notes to improve Gemini and other series of models based on what Sam 3 demonstrates here. So in other words, not only is it faster,
Starting point is 00:35:26 but it seems to be more comprehensive for concept segmentation. And I think the speed actually is a huge factor for many use cases. I think even meta we're using SAM3 for various different products use cases and fast inference speed is very critical to enable that. And so I think that's something that I think in many cases you don't even need an MLM for. It's kind of overkill to use an MLM to some applications. The other interesting thing is the Florence II results. And Florence II is a little bit older of a model now,
Starting point is 00:36:05 so maybe it's not fair to put up head-to-head with the state of the art. But it is useful as a way to just see how far we've come. Because Florence II, by comparison, labels the entire region as a single class without seeing individual detection of numbers and indicator lights and needle. And not only that, but it actually runs at about three times the speed as SAM 3. So SAM 3, again, is faster, doing a task that the other models are not doing in segmentation,
Starting point is 00:36:32 and more accurate, both in recall and precision of the things that it's intended to find, which I think really showcases the capabilities of the model. In fact, I even got kind of a little surprise about this, because this domain, this more like kind of an OCR like, because recognition numbers is nearly OCR. we do not prioritize the installment of data collection. It works. So we know that it roughly works, but I think I got surprised that it works so well.
Starting point is 00:37:03 That's encouraging, even a task that wasn't expressly prioritized. It still does a great job on. Yeah. In fact, during our data engine, we intentionally do not sample OCR-heavy images. Wow. On an easier one,
Starting point is 00:37:19 Glass Mug, Sam 3, Gemini 3, Florence 2. Sam 3 loaded first and has, really impressively, it sees even this glass mug in the corner, which I think is something Sam 3 does a great job of, is occlusion and partial objects. Gemini 3 struggles a bit with this one, I think maybe because the opacity of the objects by comparison. And then Florence 2 does a good job at finding one of the glass mugs. So again, another type of task that shows the power and feracity of the model. Yeah, I mean, exhaustivity, like finding every instance is something we heavily prioritized and is really built into the data engine design. You know, when I'm a Panktron, you want to talk about how we design the data engine to really scale exhaustivity.
Starting point is 00:38:11 Because if a human was to say an appetite every single instance, it would take a really long time and verify. but we put a lot of effort into trying to automate and speed up that process, such that we could get to the data scale and diversity needed to get to a step change. Yeah, yeah. I think definitely, I would say data engine is, you know, the critical components that we achieve sensory performance like now. So maybe we can go to the data engine picture. I think we have a kind of illustration there.
Starting point is 00:38:48 Yeah, page five. Here. You can see that this is our annotation kind of pipeline. So we first source the images and they kind of generates the non-phases. So this is the input of this task. Source images and it generates kind of non-phases from, for example, NAMA generate caption and we pass the caption to get the non-phases. This is the input distribution.
Starting point is 00:39:12 Then we use kind of sensory model in the loop to generate kind of candidates, kind of masks. that we can't, oh, that should be the candidate, but it's not perfect, especially in the beginning. Then we go to, you can say go to the next step is verification. So, since we gave you this mask, then we need to first do mass verification to verify each mask whether it's good or not. And then, after we can filter all the bad mass, there are some good mass left,
Starting point is 00:39:43 and we verify whether this, kind of this good mass are exhaustive or not, like your mark example. So for example, the kind of buy the model do not predict that's kind of, that's partial mark. Then the exhaustivity check will be kind of failing there.
Starting point is 00:39:58 Then if, the exhaustivity is filled, then we go to the next step. You can see that we can go to the pipeline, go to this kind of so-called, kind of human manual correction.
Starting point is 00:40:12 Human kind of manually unnotes all this kind of missing masks. You make this data point exhaustive. So you can see that exhaustivity is a very big factor there, and we play it as the center place in this data engine. But you can see that if we ask human annotator to annotate every mass from scratch, it will take a lot of time. I remember each data point in the beginning will take about more than two minutes to finish.
Starting point is 00:40:40 But if you use model in the loop, then it's reduced to about 45 seconds. You can use model to propose math and then just a few months to annotate the missing mass. Then it's 45 minutes. Another very key kind of innovation in this data engine is that we really find that this verification steps. Like to verify a mass is good or not, or to verify now the good mass are exhaustive or not, can be done by AI, can be done by multi-modal model. That is a breakthrough and then kind of, kind of fun-tune our kind of for example NAMA 3.2 with our kind of verification
Starting point is 00:41:22 human annotated verification data we get kind of superhuman performance on this two verification tasks and then we do not need human on this two tasks this further we in our kind of per data point annotation time to about kind 25 seconds so you can see that from the original kind of all human to about two minutes to finally kind of 25 minutes for one kind of data point how can this is kind of our journey of our data engine to make it super efficient. Did you maintain statistics on how many images were specifically hard? For example, like we had n many objects that were very difficult occluded,
Starting point is 00:42:05 or we had some number of images where the comprehensive test was really hard, or did you just bet that by having a large scale, you would encompass occlusion and exhaustive cases? In fact, we know we kind of maintain this kind of information, exhaustivity, which one is called, which one is easy, because first, in our data engine, when human annotates, then we exactly know which kind of which data point are exhaustivity by the model, which part we need human intervene. In fact, we have that kind of metadata in our data set. The second one is that the better kind of the more beautiful part is we have this kind of exhaustivity AI annotator. then we can kind of give a new data point, we can automatically decide whether this is a difficult kind of data point
Starting point is 00:42:52 or can easy data point by this AI annotator. Yeah, I think the sort of bootstrapping and annotation story was very strong last time around, and it's even stronger this time. What are you going to do when you run out of humans? Like, you know, next year you're going to have superhuman a level of everything, right? like PCS and PBS.
Starting point is 00:43:15 What then? I'm not so optimistic about this. And the first, indeed, our current plan for next project is this kind of fully automated data engine. Without a few months, that's our dream. I would see that that will. I think that is the kind of perfect thing,
Starting point is 00:43:37 but still we need some kind of useful information. There's no free lunch. There's kind of something kind of, no model can do well, and we need human to inject that useful information. I would say that what kind of practically can do is really minimal human intervention. Human only do the task that's kind of the model cannot do,
Starting point is 00:43:58 the most kind of difficult task. So that's the kind of, kind of first one, kind of internal data engine. The second one is about human performance on this kind of PCS task. My friend is that this kind of computer vision is going to enter this, and when we get to,
Starting point is 00:44:15 to human performance, we will enter this R-A-O-HF domain of computer vision. So you can see that language models, kind of before, kind of in the birth age, and the language model are not human performance, kind of SFT, kind of really imitation learning, really do their job, get to very good performance. But if you only do SFT and the SFT data is unnoticed by human, then your performance is probably by human. You cannot get kind of superhuman performance just by kind of this kind of data, engine approach to use human,
Starting point is 00:44:46 I know the data and then never found that. You need to go to this R-L-HF domain that human really just tell which one, which one is better. This is exactly the philosophy that to tell which one is better is easier to really to construct the data point from scratch.
Starting point is 00:45:05 So you can get kind of higher performance. Can get better performance from human draw from scratch. I would see that, I hope that After sensory, we can see kind of new research emerged from kind of in computer vision, which is, okay, how we go beyond human performance. Sensory is close to that, but I would say that new learning paradigm is needed to go beyond
Starting point is 00:45:29 human performance for sensory tasks and for computer vision. Yeah, now, just to add to that, this is, Patreon is only talking about images. I think video is a whole other challenging beast and getting to that really, automated data engine is something that we tried to do in SAM2. We actually didn't get to that fully automated approach. In SAM 1, we did. We fully, as A1B data set that we released was fully annotated automatically. We didn't really get to that in SAM 2 for video.
Starting point is 00:46:01 And in SAM 3 for video, I think there's still like a lot of room to push on this sort of pseudo labeling for video and really be able to get to that same step change as we had on images. What are the biggest changes to see the same step change in video that you've seen in images for automated data pipeline? Yeah, yeah. I would see that really kind of good video, large-nongued, kind of video-modal model. So when we do sensory, it's kind of earlier this year or kind of last year, you can see that image, not a multi-modal model is very good, but video-nature-modal model, I think really kind of it becomes good or practical. later this year, like, kind of quince through this kind of model,
Starting point is 00:46:45 gets kind of roughly kind of okay in that stage. So we have a good kind of base model to fine-tune on our data and to get human performance for this recognition or verification task. I would say that, you can see that we indeed definitely kind of sensory like efforts in the perception side, but we also need kind of this kind of multimodal, not language model kind of efforts, kind of good foundation model on the kind of vision on which side. I think it's ready. It's ready now.
Starting point is 00:47:14 Yeah, also video imitation is just so much more time intensive to get to that, to be able to annotate enough data to train a verifier, like video mask annotation. We just found it was like very time intensive. So maybe there are more efficient video annotation strategies. I think there's, you know, a lot of exploration that could be done there too. Yeah, you know, spending a bit of time on video. I wanted to also talk about, you know, Obviously, last time we were focused a lot on memory attention. I think this time there was this sort of mask-lit thing that I wanted to just get more ideas of, or does they share the idea just generally?
Starting point is 00:47:54 What was it called? The musklet detection? Musclead detection score, exactly. And how it's basically smoothing within a temporal window, which I think basically, you know, a lot of computer vision models don't have this, and they could just simply add it and it would be a lot more stable when it comes to video.
Starting point is 00:48:12 And I don't know why they don't do it. Maybe I can comment on this. First, why they didn't do that? I think one big reason is this streaming requirement. You can see when you want to gather information of course, the entire math nets, then you need to wait for the mass nets
Starting point is 00:48:33 and then and kind of get this strategy. So that will sacrifice some stream. kind of capability. So you can see that the streaming requirement is kind of somehow kind of limits we kind of traditional measure to do this. But I would see that this is definitely kind of beneficial. The reason why is that I think even human do this, you can imagine that when something just appears kind of at the corner of the video, like a hand appears at the corner of the window, kind of the video, you just do not know whether this is a man or woman. So, the few might even make mistakes.
Starting point is 00:49:08 Also, for essentially, it will make this mistake. But when you get more and more information, the person really enters the video fully, then you get to know, okay, whether this is a man and woman. So this kind of, gather more information to really kind of nail whether this concept is kind of the concepts you carry, is the idea here.
Starting point is 00:49:30 So there is a treat off between, kind of the latency and accuracy here. If you care more about accuracy, then you can use this overall information can all cause the mass net to get kind of more robust signal about the concept. But if you care about latency, then you need to make a decision in the very beginning,
Starting point is 00:49:55 and then you will sacrifice some accuracy. I think also in many video use cases, I think because if you were sharing on RoboFlow, users care more about detecting the objects rather than having unique identities. So in some cases, this isn't required to preserve the identities throughout the video and you just want to essentially do detection per frame. Like for the Robo Flow rapid examples you are sharing. Yeah, there's cases where being able to count and you know the objects are all going to be the same, so you don't care as much about unique classes. You just want to know that.
Starting point is 00:50:33 the full presence, things like that matter. But then there's other cases like you mentioned where, I don't know, like in sport, you care about individual players versus just knowing that there's 11 players on the pitch. One thing that might be useful actually to discuss with some of our time is we talked a little bit about how Sam 3 and MLMs will play nicely together. But there's probably like a greater discussion about how Sam 3 fits into a broader AI ecosystem and like what bigger picture trends it might fit into. Do you have some thoughts on what they,
Starting point is 00:51:03 this represents about where things are headed? Maybe I could say one point and then Pengtron feel free to add. As we mentioned before, as we mentioned before, Sam 3 isn't just a version bump. We are really having a unified model that can do many different tasks in the same unified architecture.
Starting point is 00:51:24 And so, you know, in the same way that LLMs can do many different tasks without needing a task-specific model, like with Sam 3, We're able to do image, promptable concept segmentation, video, promptable concept segmentation. We don't need a specialist model for counting. We can do interactivity.
Starting point is 00:51:44 There really is like multi-capability visual models that are on par or better than the single task state-of-the-art models. So that's really one place in which SAM3 fits into the AI ecosystem. In terms of MLMs, I don't know if Pentron you want to talk about. the agent approach? Yeah, yeah, definitely. I would, you can see, let me give,
Starting point is 00:52:10 I would see that Sansaouet kind of now kind of really get a big step change in vision, how it really helps the general AGI fit into general AGI or frontier model landscape. It's very, very kind of exciting for me. We always have this example, kind of give this kind of six finger, kind of hand up picture, as how many fingers do we have in this picture? and then you can imagine that with Sun 3,
Starting point is 00:52:37 then we can just kind of first detect how many fingers we have that very robustly, kind of six fingers, and then the multimodal model should know that, okay, this is six finger hand instead of five finger. You can see that the arrows made by frontier models can be solved if we use kind of San 3 as a tool, but then how really kind of is sensuil as a tool is the end the picture or should really somehow sensory even just be
Starting point is 00:53:06 naturally embedded into this frontier models. The frontier models have running the sensory capability by themselves. I would see that there's a lot of possibilities there. My picture is that now we have a very good green with this kind of frontier models. And we have a very good eye with sensory. Now let's see whether the eye really is kind of working together, kind of natively with the brain together, or is really kind of a different kind of organ and then need to somehow like that too to kind of work with the brain.
Starting point is 00:53:41 I think this is a very exciting kind of research area. And so in your analogy, if you think about like the visual cortex compared to like a human, human brain, like, you know, we have rods and cones in our eyes that do kind of very fast. We joke like lizard brain level detection, simple stuff. And then you have your brain that reasons about some of the visual information that your eyes see. In your example of SAM3 as a tool call or SAM3 as natively a part of the multimodal models, which future do you think is more likely? I think, as in this, I want to bet on running their work natively together.
Starting point is 00:54:18 The future for simple, I would say for simple or even intermediate difficult vision tasks. For example, counting with less than 20 objects. I think for this kind of simple task, this is like system one, kind of visual reasoning with our brain. This should be kind of our brain, and should do it by themselves. But with very, very difficult paths, you can see that if we are counting,
Starting point is 00:54:43 and maybe thousands of objects in the picture so crowded, then we can even need to kind of draw something there. I would see that at that time, maybe we did some extra model for difficult tasks. You can see that this is a high, hybrid approach, but I'm more excited, I think for most of the cases should be native. The reason why there is, you can think that I would see perception or grounding, and I really kind of know where it is, how many it is, it's like a fundamental capability
Starting point is 00:55:16 of our brain. I'm just not happy that the frontier model just cannot count how many fingers immediately, and instead of need to call a tool to do that. I think this kind of, should be system one thing, and this should be kind of natively in our brain. And also, if our brain cannot do this task, which means that it's definitely kind of missing some kind of very critical kind of visual capability by itself. So that's kind of, I would say that it's just feels that the intuition just feels that it's not correct to do not have this capability by itself. So for very simple system one questions, things like how many fingers on a hand, that should be native. maybe more complex things that are maybe long-running tasks and long-running reasoning,
Starting point is 00:56:03 then maybe there's a bit more of like a tool-call approach. Yeah, yeah. Exactly. For example, you can see that we already, kind of in our sensory agents or in our AI annotator, we even demonstrate this approach. For simple cases, the model can do it by self that, okay, I can detect, for example, 10 people here. And then the large language model can even, the AI annotator can even know that, okay,
Starting point is 00:56:27 this 10 people is not. exhaustive. Okay, there are more people there. So if you want to do kind of well, then maybe kind of you need to do more step, for example, to call an expert model. So you can see that this is a very, very kind of native kind of, kind of reasoning process for more advanced or complicated vision questions. I have a related but maybe slightly different question. M3 is an incredibly powerful piece of work. And it's open source as a part of now MSC, So open source critical to achieving AGI? Maybe I can comment on SAM specifically,
Starting point is 00:57:06 but in SAM 3, we did leverage many of the open source contributions people have made on top of SAM2. There were new data sets, there were new benchmarks, there were new kind of inference time optimizations. We adopt a lot of the things that the community built on top of the models, on top of the data sets. And so all those contributions helped make Sam 3. For Sam series, we've really benefited a lot from, you know,
Starting point is 00:57:38 being very generous with what we open source and then leveraging what the community builds on top of that. But that's just from the Sam perspective. I think it's clear what the community brings and offers. And I think, you know, every time we do this, we always shout out to the community to, like, you know, try it on their use cases and record, like, weird. findings and like, you know, if it doesn't do what you are trying to make it do, well, let's talk about it, right? And then maybe sort of implement it in the next version. Like you already said,
Starting point is 00:58:08 Plinchman, you already think to that, like what might be coming for SAP 4, which is at least a little bit more of the document and OCR work. Any other directions are interesting. I guess obviously a lot more video work as well. What is the talk of the town in like the CV community that like, you know, it would be really great or like super obvious. Like next year is going to be the year of what? Yeah. Maybe kind of. I. can first talk something and then you can I can add first definitely gonna I think even it's not simple it's sans three something and sensory point something like small models since three currently only have really kind of one model kind of one size model can
Starting point is 00:58:44 more kind of efficient model that's kind of fit for kind of eight cases and also kind of a more efficient model for video I think currently kind of the video model is not efficient you either you can achieve very good kind of throughput but you need GPUs to do that So first kind of small and efficient models, that's one big thing. The second big thing is definitely kind of video. Robafo can do that for you. Yeah. The second thing is video.
Starting point is 00:59:11 I would see that way video is still far from, I would see have a big gap from human performance. Right now there's kind of still kind of a lot of research need to be done there, how to do and to the training with video we do not have, and kind of we have this. is going to decoupled approach, but we do not end-to-end train this model, and we expect definitely, it will be kind of benefit from kind of end-train training. And also, as we just kind of on video side, really kind of how to scale up the data engine, we need definitely kind of AI annotators for video. We tried that, but yeah, we can, I think that's something and definitely works well to do.
Starting point is 00:59:53 The third one, we also discussed about that all sense, how perception fit into AI. this big landscape. Now we have the eye, how the eye work with the brain to do yourself, real reasoning path. Not only output segmentation, but really kind of answer how many cases are here, or even answer the question,
Starting point is 01:00:13 okay, I have an example of biology labs. The robots need to decide whether they can liquid in the test tube at the correct level or not. You can see that this is kind of evolved perception, but also involve reasoning, how to kind of solve this more kind of visual reasoning task with
Starting point is 01:00:33 Sam is kind of a very big direction. On the robotics topic, it was exciting to hear from like several friends that work at, you know, different robotics companies on how they're like immediately starting to use Sam three. And I think especially for the video use case, I think robotics is probably one of the domains where I think improving video performance will have a lot of impact. And so I think, Yeah, that's definitely an area that we could improve on further. But, yeah, depending to one's point, I think there's still another step change to be achieved on video PCS.
Starting point is 01:01:07 Yeah, just a quick comment on the robotics things. We're interviewing a bunch of robotics folks here, as well as, like, Fay-Fei, who obviously started ImageNet. A lot of people are betting on explicit world models, and Sam is not, for better or worse. And I wonder when that crossover might happen. There's an open question if you guys want to take any world models. models discussions, re where things are going based on like community questions.
Starting point is 01:01:33 Similar to how Nekila mentioned Dr. Sam 1, the like almost obvious thing that people wanted was like open concepts prompting because people are like, great, this model can see things, but I want to tell it what I wanted to see. And now with the introduction of Sam 3, you have this stepwise component, which feels like a key component of, you know, the chat GPT era for vision is arriving as a result. What's going to happen is now you're you've provided people with an open text box and media. And so you're going to get all sorts of queries from people that maybe the model isn't primed to be able to perform particularly well on yet. For example, earlier we were talking about document understanding and document reasoning being a place where there's
Starting point is 01:02:13 known improvements to be made. And so you'll have people that will probably prompt to try to OCR things, or you'll have people that want to do work with spatial reasoning. Like give me the object to the left of this other object or give me a sense of where things are in relation to one another, which is critical for robotics like we're discussing because that's how you navigate throughout the real world. You'll also have, I think, people will want action recognition and vision language action models, VLAs. Like the same things that, where you have these tasks where people are used to providing open text prompts and getting, here's the part of the scene where the player kicked the ball or the tennis player made the serve, those are interesting for the purposes of how to
Starting point is 01:02:51 understand and synthesize visual inputs. And so now that you've kind of given this open text box for media, there's going to be a flood of the types of things users are going to want to try to do, some of which Sam is already going to be really well adapted to do, some of which not. And I think that that's going to be, it's going to reveal itself of the types of things that are obvious. One of the things that we wanted to discuss was like where to use Sam and discover how to build with Sam. So in addition to the meta team building a tremendous playground for being able to interact with images and video and kind of apply effects for like a video emphasis. I think one of the things that we're pretty excited about with Sam 3 is how much
Starting point is 01:03:30 it positively impacts each part of building a system for visual understanding. So for example, the very first step of historically aggregating and collecting a data set because you think that there's not a model that understands the slice of the world that you want to understand is where automating away, lots of labeling can exist. Basically, if you collected a bunch of data of something that is already in the SAM3's knowledge, then you can prompt for SAM3 to automatically label all that data for you. And so we've actually made a bet on SAM3 being a core part of auto label at RobloFlow, given users a first pass of saying, hey, if you have a new image or you have a new video,
Starting point is 01:04:09 start providing just a text prompt and allow SAM3 to find and automatically label those regions of interest for you. Downstream, I think there's areas for fine-tuning. Like, you know, within a week of releasing Sam 3, med-sam-3 came out for adapting Sam into medical contexts. And I think that's a harbinger of what's to come. Like, there will be lots of domain-specific adaptations of Sam in places where maybe there's a specific ontology that someone wants to understand, or maybe there's a place where just the model doesn't have great awareness yet.
Starting point is 01:04:41 And I think we're already beginning to see that with hundreds of fine tunes that users are creating for various domains. And then the last area is like, okay, I've got my model, now I want to use it. And so one of the things that we're really proud of is to be ready on launch data, showcase the infrastructure we've built to burst and scale like infinitely large as folks have models that they want to deploy and make it readily available. Having an endpoint that serves either a fine-tuned model or a model as is, or even a model that might be able to run on edge hardware as smaller models come out
Starting point is 01:05:09 or maybe distillation comes to rise is, I think also an awesome place of where we're seeing Sam 3 being impactful at each part of like the computer vision lifecycle and pipeline. That's awesome. Yeah, I think especially the impact on speeding up annotation, I think we've seen that consistently on RoboFlow. And I'm really curious to see how Sam through the introduction of Sam 3 really helps speed up that process even further. I mean, just from playing around with it, it's so much faster than having to manually annotate every single object. So, yeah, you're really curious to see how that improves the experience. One of the things that we were pretty excited about is we were kind of able to build an entirely new product in the world of Sam 3.
Starting point is 01:05:52 And we called it rapid, but basically it's like there's probably a model that already understands the objects in the world that you want to see. So here I'm scree sharing an example of like these are vehicles next to our office in San Francisco that go by. And you can see here's a Waymo and here's like other vehicles. And like if I just have like this 10 second clip and let's say, you know, the first thing I want to do maybe is just like count cars and I want to get a sense of each of the vehicles. What's really awesome is I can just, you know, of course, text prompt and say I want vehicle. And as I toggle through different frames in my video, Sam 3 already recognizes and understands those objects. Now, one thing that I think is really interesting, there was a conversation earlier about how much you want to rely on a model
Starting point is 01:06:39 versus human's output of the model for what you care about. So, for example, let's pretend in this scene, maybe the only cars that we care about are the ones that are like before the crosswalk and maybe not far in the distance, then you'd get people that would say, hey, you know what, I actually want the objects that are like most confident. And I would like, you know, move my slider down to like getting a fewer number of objects. Whereas maybe others might say, hey, I want like every single presence of a potential object in the scene, which even gets like reflections on the building of objects. As computer vision approaches this world where we increasingly have like models that can understand and improve themselves and we rely on what human output
Starting point is 01:07:16 and human preference from the models is we're going to get these funny scenarios where things aren't all like immediately deterministic of what a human cares about and I think that's where like tooling feels a big gap but it also is going to be a place where it'll be really interesting to see where users kind of start to use and apply the models and why you need so that this last mile work to put the model in context in the domain that someone is trying to solve and tackle. So let me, since you're here, right? This is one of those things where I'm like, I'm not sure this concept, concept, the concept of labeling concepts can scale only because I don't know if I ever, if this slider between
Starting point is 01:07:59 less and more is the way. if ultimately I need to tell you whether or not to include reflections, right? Because in reflections, sometimes it's great. That's exactly what I want. Most of the time it's not going to be what I want. I don't know if some RLHF thing
Starting point is 01:08:15 is going to solve any of that because you just need more prompting. Just saying vehicle is not going to do it. Yeah, I don't know. Feel free to disagree. You can't imagine such a type plan coming, for example, as kind of SWIC said that maybe, the reflection is exactly what I want,
Starting point is 01:08:36 then you need some kind of iterations with the interface or the model or to get finally what you need. So you need to specify the concepts kind of more clearly through multiple iterations. Can human not be involved in this iteration, but just kind of models just kind of do it automatically? I think that's kind of something, definitely going to, it's,
Starting point is 01:09:01 I would see I'm quite interesting that you can imagine this workflow, and I want, kind of, reflections, and then I can kind of, with the default, kind of threshold, maybe the model will get an output. Then another kind of very strong
Starting point is 01:09:16 perception model on other kind of, like Jim Nestlewe, we'll then kind of ask, we ask Jim Nestlew, whether there's kind of some reflection there. And it says, yes. then you can see that we can automatically it's not going to move the threshold,
Starting point is 01:09:33 they're going to lower, and we're going to ask again again to see whether the reflections not included or not. So somehow this process can possibly should be done completely with AI, going to unless, yeah, yeah, exactly. So for now, the answer is image. And we can sort of tie it closer.
Starting point is 01:09:56 I think Joseph is showing us the sort of Wimo annotation. Yeah, it's nice. Now you have a WIMO model. Yeah, I was just doing an example where maybe we want to find an object that's not already represented in the training data. I think, I think prompting can solve, yeah, I think prompting could solve the problem of, like, reflections, because maybe you could say, like, vehicles on the street. But to your point, like, you would have to, like, see that that's a failure case, right? Like, if I was, like, just setting up a camera and saying count cars, I wouldn't anticipate
Starting point is 01:10:28 realizing that reflection could be a problem. And so I think this is why, like, in some ways, human in the loop, because identifying human intention, not necessarily human knowledge is what's going to be important for a lot of last mile use. But yeah, I'm pretty excited about. Yeah, maybe I want to echo kind of what Joseph said. Actually, also my experience,
Starting point is 01:10:51 just different people have quite different kind of definition of even a visual concept. for example, for some kind of data set, even hand. Some people would like to just kind of annotate the palm kind of pad as kind of their hand. And some people will kind of include the armor, kind of also kind of as hand. Then when we kind of first test three on some very kind of customized data set, we found, okay, the performance is not that good. And when we kind of finally look into kind of the kind of performance,
Starting point is 01:11:20 we found, okay, this is kind of just the user have a different definition or explanation of the concepts. but both explanations are okay. Then in this case, you can see that really need a few mind in the loop to do the kind of few short fun tuning or to adapt to the user's definition of this concept. That's exactly right. It's not always like deterministic of what someone really wants,
Starting point is 01:11:44 which is why I think like, even if you have a fully comprehensive omniscient model, putting the model into the context of what the user's trying to do is where a lot of tooling and infrastructure becomes really, really helpful. Anyway, I found our Waymos. You continue to build excellent tooling for vision, and I think
Starting point is 01:12:04 the world is very grateful for that. Let's get to call to action. I think we've sort of given a good overview, and people obviously should read the paper and try out the playground, try out RoboFlow. Is there interested in diving deeper? What is there a call to action from
Starting point is 01:12:21 each of you? I mean, try the demo, try the code. We've got a lot of resources on GitHub repo. It's a very well managed launch, by the way. Kudos. I don't know. It probably takes a lot of effort just on the launch itself,
Starting point is 01:12:36 even after the model's done. Yeah, and actually just on that, maybe one thing, just shout out to the whole team. I think this is, um, three was our biggest and most ambitious project to date. And it really took a huge team of scientists,
Starting point is 01:12:50 engineers, interns, software engineers, you know, across, across the company. So, you know, really huge shout out to the entire team that made not just the model successful, but also the demo and then all the launch and everything. So it was a huge team effort. Definitely, like, would love to hear from people on what you're using the models for, where it's failing, you know, raise GitHub issues, messages on Twitter. We'd love to hear from you
Starting point is 01:13:14 on where we should go next as well. Yeah. And on top of that, definitely kind of try out also our benchmark, the cycle benchmark. would say that it's like me that the benchmark will last longer than our sensory model. Maybe next year there will be a stronger model, but the benchmark is the one that I hope to guide the community to get better and better models, kind of to get to a kind of way major human performance on the benchmark. I think maybe we are the first one to do that for this kind of very kind of segmentation and the kind of video or kind of Guangding and past. It's very difficult to measure human performance on this task.
Starting point is 01:13:54 Hopefully, this benchmark unguides the community to achieve human performance for this task and even going to surpass human performance there. We set out to be one of the best places, if not the best place, to build with SAM 3 and the SAM family models. So we're going to see what people build with SAM and computer vision models to move the whole field forward. We have infrastructure for everything from deploying SAM3-Zero shot to making your own fine tunes, to automating labeling of data with SAM. And we continue to see the impact with each subsequent release expand the number of use cases and the amount of use and accelerate the time to value.
Starting point is 01:14:28 So excited to see what folks can build on RobloFlow with Sam. Thank you all so much. This is a really great company. It's great work. And just obviously always expands my mind as to what is possible with machine running. Yeah, I mean, we're not at ASI yet or AI yet. But every day we're getting closer. Awesome.
Starting point is 01:14:48 Thank you so much. Thank you. Thank you. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.