Latent Space: The AI Engineer Podcast - SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)
Episode Date: December 18, 2025As with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!)From SAM 1’s 11-million-image data engine to SAM... 2’s memory-based video tracking, MSL’s Segment Anything project has redefined what’s possible in computer vision. Now SAM 3 takes the next leap: concept segmentation—prompting with natural language like “yellow school bus” or “tablecloth” to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio:SAM can now even segment audio output!We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception.We discuss:* What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like “purple umbrella” or “watering can”* How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly* Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and “fast mode” tracking* The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity* The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2* Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale* Architecture innovations: presence token to separate recognition (”is it in the image?”) from localization (”where is it?”), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking* Building on Meta’s ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2’s memory-based tracking backbone* SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like “find the bigger character” or “what distinguishes male from female in this image”* Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples* Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more—MSL FAIR team* Nikhila: https://www.linkedin.com/in/nikhilaravi/* Pengchuan: https://pzzhang.github.io/pzzhang/Joseph Nelson* X: https://x.com/josephofiowa* LinkedIn: https://www.linkedin.com/in/josephofiowa/Full Video EpisodeTimestamps00:00:00 Introduction and the SAM Series Legacy00:00:53 SAM 3 Launch: Three Models in One Release00:05:30 Live Demo: Concept Prompting and Visual Exemplars00:10:54 From Prototype to Production: The Evolution of Text Prompting00:15:45 The Data Engine: Automating Exhaustive Annotation00:14:10 Real-World Impact: 130 Years of Humanity Saved00:25:11 Architecture Deep Dive: Decoupled Detection and Tracking00:28:02 SAM 3 Agent: Bridging Vision and Language Models00:33:20 Head-to-Head: SAM 3 vs Gemini and Florence00:47:50 Video Understanding and the Masklet Detection Score00:20:24 Fine-Tuning and Domain Adaptation: From Waymos to Medical Imaging00:52:25 The Future of Perception: Native Vision vs Tool Calls01:05:45 Building with SAM 3: Roboflow's Rapid Auto-Labeling00:57:02 Open Source Philosophy and the Path to AGI00:58:24 What's Next: SAM 4, Video Scale, and Beyond Human Performance This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Okay, we're here in the remote studio with the grand return of the Roboflow and Layton Space and Sam combo.
Welcome to Joseph, my sort of Vision co-host, I guess.
Thanks.
Great to be here.
Welcome back.
We also have welcome back, Nikola Ravi, who's the lead on Sam 2.
I guess just Sam in general, right?
And we have joining us Peng Quan, who's also a researcher on Sam.
Yeah, nice to meet you guys.
So congrats on Sam 3's launch.
I mean, like the demo, each time you,
you set it up, like, really amazingly.
And I think, like, every time, my general impression or takeaway when I tell people about Sam
is, like, just the, every time you have a new release, like, it's like, once a year you show
up, you drop a banger and then you, you, like, you know, you just, like, drop the mic and
and go for next year.
And you also add a dimension.
So I was entirely, like, weirdly not surprised when Sam 3 had the 3D thing.
Because I'm like, well, yeah, which is the next dimension to go?
It's like 3D.
Actually, maybe just on that, I think that's actually a common misconception.
We launched three separate models this time.
It was SAM 3, SAM 3D objects, and SAM 3D body.
Yes.
Those were two completely separate models, and SAMH3 is just the image and video understanding model.
Which is on a deader backbone and is sped up.
Yeah, sorry, I didn't mean to sort of pre-face all this.
But maybe just to remind our audience or maybe for people new to the Sam series of a podcast that we've done so far,
maybe each of you can sort of go around and intro like your or your sort of entry into computer vision or sort of your relationship with Sam.
Go ahead, Nikki.
Okay, cool.
Hi, everyone.
I'm Nikila.
I'm a researcher at Meta.
I've been at Massa for eight and a half years.
So really been through evolution of the field in that time.
It really started working on a range of different problems in computer vision, worked briefly on 3D.
We bought this library called Python 3D.
But I really started on this segment anything as a project in around sort of late 2021.
So it's actually, you know, been almost four years since I've been like working on this segment anything space.
And, you know, we started with Sam 1 in 2023, Sam 2 last year in July 2024.
and then now Sam 3.
So it's been a combination of a lot of work of a lot of people over the years.
So yeah, really, really excited to be at this point and, you know, get to share it with all of you.
I'll hand it over to Pengtron.
Yeah.
Hello, everyone.
So I'm Pengtran.
I'm a researcher at the Sun team.
I have been working in computer vision this field for nearly nine years, starting from 2017.
I think it's a long time.
I have been working in MSR for five years and then move to Meta-Reality Lab to work on
egocentric Foundation models on AI glasses for a while.
And then in 2020, I moved to Sun Teme, and that time is exactly the start town of San Slui.
And Reruneg and I think that's the lifetime experience I have on the Sansloui team.
And it's glad that Sons Lui is out and I kind of achieve my original grand goal of
computer vision to reach kind of human performance of detection, segmentation, tracking image
and videos.
I'm Joseph co-founder, CEO at Roboflow, where our mission is to make the world programmable.
We think software should have the sense of sight, and models like Sam and others are critical
to unlocking that capability.
Now, millions of developers, half the Fortune 100, build with Robaflow's tools and infrastructure
to create and deploy models to production.
We've been big believers of the meta family of open source models, all the way
back to like mask R-CNN and Detectron 2 all the way to presence of Sam 1, Sam 2, and Sam 3.
The work that the meta team does to advance state-of-the-art and open-source computer vision
has been bedrock to enabling developers and enterprises globally to adopt AI.
So we've been big fans of the work, and I'm pleased to be joining you today, Swix,
to co-host the episode on Sam 3.
And you guys shipped your own debtor model, too.
Yeah, we've been doing some work to.
advanced machine learning research too.
Like one of the, for example,
debtor detection transformers, which was born out of
NERIPP's last year, I think SWIX you actually challenged
us. You were like, hey, what are some of the advancements
that are happening in computer vision and
in visual AI? And we had this observation
that transformers had surpassed
a lot of CNNs in vision
tasks, but they hadn't been made to run
real time, as in
over 30 frames per
second, for example, on like a small
T4, or excuse me, small like edge device
in hundreds of frames per second on like a T4.
we did some research and published RFDatter,
RobloFlow Detection Transformer,
which is, you know,
we kind of joke the greatest of all time model
for doing real-time segmentation
and object detection on the edge.
Now, in RFDetter,
it's, you know, you have to have a fixed class list
and need to know some of the objects
that you want to segment at a time.
But for anyone that's running on, like,
constrained compute and on an edge device
and wants like an Apache 2 model to do that,
RFDetter and its family of models
are key to fulfilling that mission and that goal.
Yeah, amazing. Okay, I think we are going to just go into a SAM3 demo. I think Nikki, you've prepped some stuff to show less.
And this is great because obviously there's nothing better than the creator of the tool showing off the tool.
So just to start with like, what is SAM3? So SAM3 is a model that can detect segment and track objects and images and videos using what we call concept prompts.
So I'm going to start with a simple image example.
then we'll show you a video example.
So a concept can be anything that is a short text phrase.
So here, for example, we can use something like watering can,
and you can see the model predicts a mask for the watering can.
You can also then refine the prompts using clicks or additional visual exemplars,
which I'll show you in a different image.
But essentially, the idea of a concept prompt opens up the ability to find all instance,
of an object category without having to manually click on every single instance,
as you would have had to do if you were using SAM2 or SAM 1.
Now, if the model misses any of the instances, you can add visual exemplars.
So a visual exemplar is also a way to describe a concept to the model.
So here I can add a positive box here and show the model that this is all.
also an instance of a flower that we want to detect.
So this is just an images, but what's really cool is you can now also do this in video.
And so here I'll show you an example.
Maybe this is a football match.
You want to track all the players in white, for example.
So red jersey or white jersey, you can provide a concept prompt.
And the model will find the objects in the first frame and then track and disqual.
detect the new instances that appear later on in the video.
So it's not just detecting on the first frame,
but both tracking those detections and finding new instances
that appear throughout the video.
And one of the things we love to do in our demos is also show
some real-world applications of this,
and so one idea here is that you can use this for video editing or adding effects.
So here it was a really simple mask effect,
but you can imagine, for example,
you might want to add a trail around the players.
You know, you can follow them around.
Maybe you want to clone them.
So you've got multiple players running around.
You can also do background effect.
For example, spotlighting players.
And so these are just fun things you can do on top of the SAM3 outputs.
And this is just like a way to show people like what you can do.
There's also some templates which basically are pre-populated.
with the text prompt and an effect.
And these are just some fun ways you can use the outputs,
but really, you know, the crux of it is in this, like,
create from scratch where you can upload any image or video and try Sam 3 on that.
And we'll share the link so you can try it out as well.
One of the other demos that I have is like a busy scene for like doing labeling,
which we can do later on, but just to give you a preview.
It's like if you want to find tablecloth and maybe like back there,
there's like airplanes.
so I'll do airplane,
and you kind of get the ability to start to...
Do you find the confidence thresholds?
They do.
I don't know why tablecloth wasn't as good.
I've used that one in the past.
Table, maybe?
Yeah, cool.
Wow, look at that.
Yeah.
I think the other impressive thing that you guys
emphasize in your launch is also like the latency.
I don't know where this particular inference is running,
but it says something like,
Sam 3 runs in 30 milliseconds on single image
If I want 100 detected objects on an H200
Obviously this isn't an H200
But it's also like just impressively fast
And sometimes basically you can be real time if you want
Yeah definitely on images
On images it's really fast
And then on video it kind of scales with the number of objects
But it's for limited number of objects
It's still real
Yeah
Also add even for video
If you can't afford the kind of GPUs,
pretty kind of very kind of parallel influence, I agree with them.
So even you have a lot of object to track,
you can still get real-time tracking performance
as long as you scale up the GPUs there.
So I'm reading in the paper,
it's 10 objects on 2-H-100s, 28 on 4-8-200s and 64-on-8-200,
something like that.
I don't think there's an architecture.
I don't know if this is the parallelism demonstration that we're talking about.
Yeah, in fact, when you,
when you can try the demo, the video
to the kind of parallel implementation of the video guanning.
So it's already in that fast mode.
Yeah, you try it with a video with like lots of objects,
and then you can notice that it's actually not very slow,
and you get the sense that we are doing the multi-GPU inference.
Yeah, everyone should try it out and see for them.
So, okay, amazing.
So this thing about concept segmentation, I feel like you had a prototypical version of this.
And in your paper, you really talk about like sort of generalizing it.
I guess like, what was the planning like in Sam 3?
At the start of this, is what we have today exactly what you planned for?
Or did you kind of, did it emerge as you discover capabilities?
Maybe I could quickly talk about, yeah.
In Sam 1, we did have a proof of concept of text prompting.
But that was just a very early expiration.
it wasn't really built out and, you know,
became the most highly requested feature since then.
And so we, you know, in Sam 3,
we really wanted to do it properly
and actually do this in a way that it works
in all different scenarios.
And so we had to really think about how to formulate the problem.
So it could have been that we took open-ended text input
and it works for all open-ended text,
or we could have be more focused,
which is what we chose to do,
and really focus on these atomic visual concepts like yellow school bus or a purple umbrella
and really focus on nailing the problem for these like atomic visual concepts.
But Pengtron, maybe you want to talk a little bit about kind of the benchmarks that existed
previously and how we had to actually fully redefine the task and the benchmark that we wanted to
solve. Yeah, and maybe just to add to Pengtran's point, like if you look at the size of these
benchmarks. The previous benchmark, Pentagon mentioned Elvis that everyone uses. It has about
1.2K unique concepts and the benchmark that we created, which we're calling Segment anything with
concepts or SACO, Coco, for short. SACO has more than 200,000 unique concepts. If you think about
the natural language that people use, we don't just use a thousand words we use. We have a very
large vocabulary and we really wanted to build a benchmark that can capture that diversity in
size. Yeah, it's, it's really impressive and also like very formulaic, I guess, or classic that
every great model works starts with a lot of data work. I think basically is, you know,
the scaled-up version of the same process for Sam 2. Yeah, in some ways, I think the in Sam 3
data engine really was like a very novel and critical component.
I think, you know, to your point,
a digestive advantage in AI is not just about the models,
but really about the data,
and maybe even more so is actually the data engine
to generate that data.
And we put a lot of effort in SAM3 specifically
to try and automate that process a lot.
One of the things that we're really impressed by
is the diversity and depth,
as well as breadth of uses that we see with models like Sam in production.
Basically, when you think about computer vision,
you know, folks kind of like always classily,
they think about like dogs and cats and simple sorts of things.
And the reality is, like, computer vision is where AI kind of meets the real world.
So any sort of thing that needs to be seen and understood,
you need to have understanding of that thing.
So a model like Sam,
expanding the concepts from like, you know,
a few thousand closed-form concepts max in a single model
to tens of thousands of concepts means that you're going to see
such a huge acceleration of the number of fields and applications of the model.
So this is SAM3, right?
So we've already seen and measured some of the impact of the SAM family of models.
And we pulled some of the updated stats on how impactful Sam is being across the RoboFlow community.
I think RoboFlo might maintain one of, if not the largest, hosted instances of Sam.
And we've seen basically 106 million kind of smart poly created examples that are Sam 1, 2,
or three powered. And we estimate that that saved humanity collectively, like 100, maybe 130 years,
depending on exactly how you want to do the calculation of time, just curating data. And each of those
use cases, right, isn't dogs and cats on the internet. It's things like, I don't know, we see
medical labs across the world that are accelerating cancer research by doing things like counting
and identifying the automation of neutrophils after a given experiment. Or we see folks that are
using aerial imagery for things like helping a drone navigate for the world, or maybe counting
and seeing solar panels from above, or maybe even doing like insurance estimates.
We see folks that are building underwater trash cleaning up robots.
So like you can imagine an autonomous underwater bot that's navigating through the Pacific
Ocean and identifying and grabbing on and grabbing plastics and cleaning up the world's ecosystem.
Relatedly, we've seen some work with aquariums across the U.S. like Embari, who are doing work for
keeping track of species and identifying the impact of ensuring given steps that are taken or
increasing the populations of given fish with like underwater fish cameras. We see folks in industrial
settings like doing work to produce electric vehicles or get products from point A to point B. At the time
I'm recording this, it's like near Christmas time and it's like high time for holidays for folks that
are doing gift giving. And that ends up being really, really high time for making sure goods and services
show up where they're supposed to be at the given point in time. One of the statistics is a
that we track is the frequency with which folks cite works like Sam or Robaflow or blogs that we publish.
And there's now basically like a little over two research papers published every day,
citing some of the work across like the Robafel community.
And that's folks that are like publishing in nature and science direct and a fairly prestigious number of journals.
And each of those, you got to think about it.
Each one of those publications is someone's like seminal work, often six, 12, 24 months of effort that's been
accelerated from models like Sam. So it's not an exaggeration to say, like, models like Sam are
speeding up the rate at which we, you know, solve global hunger or find cures to cancer or make sure
critical medical products make their way to people all across the planet. And at the infrastructure
level, we're, like, thrilled and surprised constantly by the breadth and depth of adoption that we see
from the community. I mean, in the first five days of Sam 3, there was like eight million
inferences of folks that were running across all diverse sets of fields. And that's actually only
increased because it was released and then there's like Thanksgiving and now it's back and folks
are like hitting it pretty hard. So it's been incredibly encouraging to see the both depth of adoption
and how much the community takes and uses and relies on models like Sam and prod.
Yeah. And I think from maybe just to add to that from like meta side, like we don't usually get
as much visibility into all of these real world use choices. They're, you know, being able to kind of
hear that from RoboFloor and having these models available on the platform is like so valuable
for us. It's also, you know, we get to know how these models actually work in the real world,
which is, you know, ultimately the best eval for a model. So I think, you know, it's definitely
awesome to hear about all these things that we're empowering. Nicola, you had this, you had this
comment of like the best eval for a model is like, it's not necessarily benchmark. What was it?
It's like if it works on real world things. I think it's a really good soundbite.
something like the best e-vowel is if it works in the real world.
Yeah, true.
And that's like the ultimate goal for all of our models, like Sam 1, Sam 2, Sam 3,
we want people to use it out of the box as much as possible.
And I think, you know, with language in Sam 3 specifically,
there does need to be, in some cases, some domain adaptation.
But we have sort of tried to make that easy.
I don't know, Pengtron, you want to talk a little bit about that,
like the fine-tuning aspect?
I wanted to also endorse
like the real-world thing.
I was just so happily surprised
when I was visiting the CZI
Imaging Institute
in preparation for our pod with Mark
that they were using Sam
in imaging the human cell.
And they showed us like how
in reality all these sort of masses
are actually like really undifferentiated
and it's really hard for the human eye to track.
This is actually a simpler one
where you can actually there's not,
This is like pretty clean here.
In reality, a lot of it is just like, just gray mush.
And you have to like segment individual lysomes out of these.
And they showed us how they were using Sam and fine-tuning Sam to do it.
Yeah, really, really, really complicated and also like very meaningful, right, for, for basic science research.
And I also maybe mentioned like this in the paper, the distribution, you can actually see what SACO does.
So a lot of animals, a lot of animals.
And then very surprisingly few maps.
I'm like, maybe there should be more maps.
I'll say Huggings face has been doing a lot here and other companies.
Yeah, this is actually one thing.
Something we get asked a lot is like, what's the minimum amount of data I need to fine tune?
And, you know, being able to do that with just sort of 10 data points is hopefully we'll unlock a lot more than we can do ourselves.
Yeah, I mean, the more the merrier, obviously.
This is where ablations are really helpful.
You probably didn't have any fine-tune oblations in here.
I think this is all data and model training oriented.
But yeah, I mean, like very very clear.
I just have a cheeky, curious point.
Is there a ratio of what is the ratio of the negative example to positive example, right?
So in Nicola's example, when you were demoing just now, you only selected positive examples.
Obviously, there's going to be a lot more negative examples of not class than positive example of class.
So there should be some exchange ratio where like negative examples contribute smaller than a positive
example or is that not the case?
For positive and negative examples, I don't know that I have seen like a golden ratio that
that works well or not works well, but I can't offer anecdotally that a single negative
example goes a long way.
A common place where fine-tuning is really helpful is like data that's out of distribution
that might, might have been impossibly in distribution.
Like one of my favorite fine-tuning examples is like counting Waymos.
There's not that much data that have waymos labeled throughout the streets of San Francisco,
but Sam does a really good job to identify Waymo as like a vehicle.
If you prompt with Waymo, it doesn't find anything, you find vehicle,
it labels a Waymo as a vehicle, which is valid, but a Waymo is a specific type of vehicle, right?
Usually, from even just like a 10-second video clip,
you can actually start to have Sam 3 learn what should have been seen as a Waymo
versus what should have been seen as a vehicle.
And even on a single image example,
we see that like Sam 3 starts to adapt
because it takes the text and image prompt into account
when it makes a subsequent inference.
From like three to five negative examples
alongside positive examples,
you start to see the model update its priors, if you will,
for where it would predict things from what the user provided.
All this is written with caveats, right?
Because like when you talk about visual world,
the negative example and the positive examples
could have been a very different perspective
or a very different type of object.
Like maybe you're like labeling dog breeds and suddenly a new dog breed appears or maybe you have a
perspective where it's overhead and then suddenly you have a side by side view.
So usually the best way is to like have these things meet the real world data and try.
But I'll offer maybe the note that a small number of negative examples, so it was a really long way,
like small like three to five, not like hundreds.
Yeah, the other place when negatives play a big role is just is it in the image or not.
and that was one of the things that we did was really separate the problem into a recognition
problem and a localization problem. So first, can you answer the question, is this object or
is this concept in the image? And then if it's in the image, where is it in the image? And so to really
to really build in that capability, we had to annotate a lot of negative phrases in images. So basically a lot of
phrases that don't exist in the image in addition to the concepts that exist in the image with
the corresponding mask pair. So we have, you know, if you look at one of the tables in the paper,
which shows the training dataset distribution, I think it's table 24. We have about 70, more than 70%
of the annotations are these like negative phrases that are not present in the image. So we have to
really train the model to not detect stuff that is not in the image.
Yeah, I think that the separation of localization and,
it's basically precision recall, right, but in the vision domain.
We basically add this presence token to the model,
which explicitly separates the task of recognition and localization.
So basically it simplifies the task.
And so the model doesn't have to try to do everything with just the proposals in the detector,
be able to have this global like sort of learned token just for the recognition part.
Yeah.
In general, I find that you guys did a lot of extra net new work.
You had a really nice chart in here about like the yellow boxes being like the new stuff.
forget where.
Yeah, the architecture diagram.
Yeah.
I'm like, holy crap.
Last time it was like,
you know, there was like the memory stuff.
This is Seb 2.
And here it is all this.
Obviously, you know,
it's hard to cover it all, but
you know, I wonder if there's any other
interesting stories or tricks
like the presence
token that you might want to focus on.
Yeah, I mean,
this is nice.
this diagram, I'm glad you brought it up, because Sam 3 isn't just a version bump.
It's, you know, an entirely new approach to do segmentation.
It's like this new interface for segmentation, and it combines so many different tasks
where previously you would have needed a task-specific model for each of these tasks.
You know, interactive segmentation, text prompting, open vocabulary detection, tracking,
like all of these tasks you would have needed a separate model.
So it really had to do a lot of work to bring it together.
I think one of the things we did was really decouple the detection component and the tracking
components.
So you can see, you know, we still preserve the tracking components from SAM2.
But the detector is separate.
And the reason we do this is, if you think about what a detector has to do and what the tracker
has to do, the detector needs to be identity agnostic.
So if you have a concept dog, it needs to be able to find all instances of that dog,
and it needs to sort of have this representation of dog that is the same for all dogs.
But when you're tracking those dogs through the video,
each dog needs to have a separate representation such that we're able to preserve the identities.
And so there's this kind of task conflict that emerges between the detector and the tracker.
And so we really had to, you know, we experimented a lot,
we really tried to build kind of a unified approach to do things,
but then what we found was having the separate detector and tracker really worked.
But we used the perception encoder as this shared visual background.
And there's sort of a text and image aligned encoder.
You can see the green boxes there, they're from, it says from PE.
That's perception encoder.
That was also from.
our group in Bear at the time.
This was released earlier this year in April.
And so this really is bringing together components from like the entire fair and meta ecosystem.
We have perception encoder.
We have a deep detector.
We use SAM2.
We also use Lama and our data engine.
So we really like using all the components from.
Yeah, it's like any third film in the trilogy.
Like you always see like the previous recurring characters come back.
Yeah, well, it'll work.
You've got to continue using it.
And to connect to something we just discussed earlier,
you mentioned that at video component,
each object needs to be tracked independently.
That's why the compute scales linearly with the number of classes, right?
Because each of those instance types needs to be maintained.
Each of the scales with the number of detected objects.
Yeah.
So, for example, like each dog that appears in the video,
each one of those needs to be tracked independently.
There was something else that you started to allude to in the paper
that I was hoping we would spend some time discussing,
and it's interaction of Sam 3 and LLMs, Lama, and others.
So using Sam 3 to almost be like a tool call for LLMs to give them better grounding
and give them better visual understanding.
And there's a paper in the table where you describe the increase in performance.
It's kind of alluding, I think, to maybe where things are going for using Sam 3 as a
component part of multimodal architectures.
Do you want to describe a bit about what the introduction of that work was meaning to showcase
and how the interaction of Sam 3 and LLMs is envisioned to be important.
Yeah, maybe I can just do a quick intro and I'll hand over to Pengtron to do the deep dive.
But essentially, as I mentioned, Sam 3, we constrain the text input to these atomic visual concepts like yellow school bus or yellow watering can.
But obviously, people want to interact with the model of natural language and we want to enable that as well.
And so that really segues into being able to use SAM3 as this visual agent for an MLLM.
And so I'll hand over to Pengtron.
Maybe you can explain about the SAM3 agent setup and then talk through some of the results that we got there.
Yeah, yeah.
So as you cannot mention, the big picture is that Sansaulet is focused on this kind of atomic concept.
But people definitely want to try much more complex places.
like, okay, and going to produce locates the bigger kind of character for me.
For example, this can nine example, what is the kind of, the feature that distinguish male and female in this picture.
Then these are more complex language.
This is exactly going to sense three cannot do, but sensory agents target to solve.
In this case, you can see that it needs much more advanced.
language understanding and reasoning.
The sensory currently
do not have this kind of
capability because it's small
language encoder. But
we know that large language models
definitely have
what's trained on a lot of this data and
has this kind of word knowledge and
the reasoning capability.
The sensory agent
is exactly using
sensory as a I
for the large language models
to solve this
kind of complex visual grounding tasks.
Is there any sort of insights that you or surprises that you have
other than, I guess, like, Sam is a very good tool?
Is that the main conclusion?
Go to Table 8 in the paper as you described this, if you don't mind.
Table 8.
Okay.
Yeah.
Yeah.
Go to go.
Yeah, please.
Maybe kind of quickly reply to kind of sweetly.
SWIx question. I would see that first, besides that sans slu is really a good tool,
kind of provides the eye for large language model. The other thing we definitely found is that
Sunsui is not perfect. It's not like kind of as robust as kind of human eye. Then language
model also kind of helps to correct the sound error. They have a synergy between each other instead
of just, okay, large language border provides the brain and the sensory provides the eye.
Interestingly, you use number four.
I saw you, there's a mix of Lama 3 and number 4 here.
But it looks like it does best with Gemini 2.5, which makes sense given this comparable set of MLMs.
I'm just like, I think like the baseline also is just that like, well, what extra addition does this add on top of just the NLM?
I would maybe like want to do that publication.
Maybe you've already done it somewhere.
What do you mean by additional thing?
So it's basically like without a tool call, there's some native capability.
inside the MLM itself.
Wow.
In fact, that's a really kind of good question.
In fact, I will are going to review even ask that question.
So without, you can imagine that without
large language models,
without VOM and kind of says slowly,
only for kind of reason sake,
it only achieved about,
kind of on the validation set,
if I remember it correctly,
it's only achieved kind of 30,
kind of numbers there.
And also, it's very intuitive.
You can see that for reason
sake, it has this kind of
short lung untested.
It has kind of different
subset, short non.
Short, then it's very close to
sensory training data. Like, it's
kind of atomic phrases, short phrases.
None is this kind of very complex
reasoning. You will see that
for short, sensory only, it's very
close to the sensory
agent. But for lung, the
gap.
is sold large, which indicates that, okay, that is exactly the capability.
That's not a language model bringing.
Got it.
I can show an example here that might be insightful too.
Go for it.
So even comparing like Sam 3 and Gemini, let's say that we just want to have them do
like an object detection task here of finding here we're going to prompt with a speedometer
and RPMs, and we're going to ask for things like indicator light, number, and needle.
And if we run Sam 3 head to head with Gemini 3 and Florence 2,
almost as a baseline of where things have been,
and we see each of the results.
First things first, you'll note that the speed of inference of Sam 3 is quite quick.
This is just calling the Gemini 3 Pro API,
so whatever is provided from hosted compute is sort of what you get on the response time.
And then the second thing you'll note is, in addition to speed,
is some of the accuracy of results
who might get it,
we might have a timeout error, let's see.
Do you have Elo scores?
What scores?
Elo scores, like,
Elo, yeah.
Yeah, you had the arena.
Okay.
I was wondering what the EO was,
because you said you were blind testing this.
Yeah, that's actually interesting
because we had blind tested SAM3
before it was released not a SAM3,
just for people to try and compare
I think we called it like a potential SAG or SEG preview or something.
And we allowed users to vote and they kind of unanimously voted for what they didn't
know at the time was SAM 3.
We actually got like emails of people being like, hey, like where can I use that?
And we just sort of ignore them until the model came out.
But so here with the responses, you see that the grounding capabilities of Sam and Sam 3 compared
to even Gemini are out ahead currently.
So not only is it doing grounding,
but if you look closely,
you can actually see it's making segmentation masks too,
whereas Gemini III struggles,
it just does detection by comparison.
And then the other thing is just the richness of detections,
like the recall is as high as well as the precision.
And if we compare here, it does it almost as well, right?
But you see that it misses some of the numbers
and has kind of these,
some of these erroneous boxes that it's that is predicted. And then it also doesn't do segmentation.
So it just does detection of the task. So you can envision that the same way the Sam 3 paper
introduces the idea of using Sam 3 in tandem with MLMs. I would expect that to be the case
pretty soon. And maybe the Google team taking some notes to improve Gemini and other series
of models based on what Sam 3 demonstrates here. So in other words, not only is it faster,
but it seems to be more comprehensive for concept segmentation.
And I think the speed actually is a huge factor for many use cases.
I think even meta we're using SAM3 for various different products use cases
and fast inference speed is very critical to enable that.
And so I think that's something that I think in many cases you don't even need an MLM for.
It's kind of overkill to use an MLM to some applications.
The other interesting thing is the Florence II results.
And Florence II is a little bit older of a model now,
so maybe it's not fair to put up head-to-head with the state of the art.
But it is useful as a way to just see how far we've come.
Because Florence II, by comparison,
labels the entire region as a single class
without seeing individual detection of numbers and indicator lights and needle.
And not only that,
but it actually runs at about three times the speed as SAM 3.
So SAM 3, again, is faster, doing a task that the other models are not doing in segmentation,
and more accurate, both in recall and precision of the things that it's intended to find,
which I think really showcases the capabilities of the model.
In fact, I even got kind of a little surprise about this, because this domain,
this more like kind of an OCR like, because recognition numbers is nearly OCR.
we do not prioritize the installment of data collection.
It works.
So we know that it roughly works,
but I think I got surprised that it works so well.
That's encouraging,
even a task that wasn't expressly prioritized.
It still does a great job on.
Yeah.
In fact, during our data engine,
we intentionally do not sample OCR-heavy images.
Wow.
On an easier one,
Glass Mug, Sam 3, Gemini 3,
Florence 2. Sam 3 loaded first and has, really impressively, it sees even this glass mug in the
corner, which I think is something Sam 3 does a great job of, is occlusion and partial objects.
Gemini 3 struggles a bit with this one, I think maybe because the opacity of the objects by
comparison. And then Florence 2 does a good job at finding one of the glass mugs.
So again, another type of task that shows the power and feracity of the model.
Yeah, I mean, exhaustivity, like finding every instance is something we heavily prioritized and is really built into the data engine design.
You know, when I'm a Panktron, you want to talk about how we design the data engine to really scale exhaustivity.
Because if a human was to say an appetite every single instance, it would take a really long time and verify.
but we put a lot of effort into trying to automate and speed up that process,
such that we could get to the data scale and diversity needed to get to a step change.
Yeah, yeah.
I think definitely, I would say data engine is, you know,
the critical components that we achieve sensory performance like now.
So maybe we can go to the data engine picture.
I think we have a kind of illustration there.
Yeah, page five.
Here.
You can see that this is our annotation kind of pipeline.
So we first source the images and they kind of generates the non-phases.
So this is the input of this task.
Source images and it generates kind of non-phases from, for example,
NAMA generate caption and we pass the caption to get the non-phases.
This is the input distribution.
Then we use kind of sensory model in the loop to generate kind of candidates, kind of masks.
that we can't, oh, that should be the candidate,
but it's not perfect, especially in the beginning.
Then we go to, you can say go to the next step is verification.
So, since we gave you this mask,
then we need to first do mass verification to verify each mask whether it's good or not.
And then, after we can filter all the bad mass,
there are some good mass left,
and we verify whether this, kind of this good mass are exhaustive or not,
like your mark example.
So for example,
the kind of buy the model
do not predict that's kind of,
that's partial mark.
Then the exhaustivity check
will be kind of failing there.
Then if,
the exhaustivity is filled,
then we go to the next step.
You can see that we can go to
the pipeline,
go to this kind of
so-called,
kind of human manual correction.
Human kind of manually
unnotes all this kind of missing masks.
You make this data point exhaustive.
So you can see that exhaustivity is a very big factor there,
and we play it as the center place in this data engine.
But you can see that if we ask human annotator to annotate every mass from scratch,
it will take a lot of time.
I remember each data point in the beginning will take about more than two minutes to finish.
But if you use model in the loop, then it's reduced to about 45 seconds.
You can use model to propose math and then just a few months to annotate the missing mass.
Then it's 45 minutes.
Another very key kind of innovation in this data engine is that we really find that this verification steps.
Like to verify a mass is good or not, or to verify now the good mass are exhaustive or not,
can be done by AI, can be done by multi-modal model.
That is a breakthrough and then kind of,
kind of fun-tune our kind of for example NAMA 3.2 with our kind of verification
human annotated verification data we get kind of superhuman performance on this two
verification tasks and then we do not need human on this two tasks this further
we in our kind of per data point annotation time to about kind 25 seconds so you
can see that from the original kind of all human to about two minutes to finally
kind of 25 minutes for one kind of data point how can this is
kind of our journey of our data engine to make it super efficient.
Did you maintain statistics on how many images were specifically hard?
For example, like we had n many objects that were very difficult occluded,
or we had some number of images where the comprehensive test was really hard,
or did you just bet that by having a large scale,
you would encompass occlusion and exhaustive cases?
In fact, we know we kind of maintain this kind of information, exhaustivity, which one is called, which one is easy, because first, in our data engine, when human annotates, then we exactly know which kind of which data point are exhaustivity by the model, which part we need human intervene.
In fact, we have that kind of metadata in our data set.
The second one is that the better kind of the more beautiful part is we have this kind of exhaustivity AI annotator.
then we can kind of give a new data point,
we can automatically decide whether this is a difficult kind of data point
or can easy data point by this AI annotator.
Yeah, I think the sort of bootstrapping and annotation story
was very strong last time around,
and it's even stronger this time.
What are you going to do when you run out of humans?
Like, you know, next year you're going to have superhuman
a level of everything, right?
like PCS and PBS.
What then?
I'm not so optimistic about this.
And the first, indeed,
our current plan for next project is
this kind of fully automated data engine.
Without a few months, that's our dream.
I would see that that will.
I think that is the kind of perfect thing,
but still we need some kind of useful information.
There's no free lunch.
There's kind of something kind of,
no model can do well,
and we need human to inject that useful information.
I would say that what kind of practically can do
is really minimal human intervention.
Human only do the task that's kind of the model cannot do,
the most kind of difficult task.
So that's the kind of, kind of first one,
kind of internal data engine.
The second one is about human performance
on this kind of PCS task.
My friend is that this kind of computer vision
is going to enter this,
and when we get to,
to human performance, we will enter this R-A-O-HF domain of computer vision.
So you can see that language models, kind of before, kind of in the birth age,
and the language model are not human performance, kind of SFT, kind of really imitation learning,
really do their job, get to very good performance.
But if you only do SFT and the SFT data is unnoticed by human,
then your performance is probably by human.
You cannot get kind of superhuman performance just by kind of this kind of data,
engine approach to use human,
I know the data and then never found that.
You need to go to this R-L-HF domain
that human really just tell which one,
which one is better.
This is exactly the philosophy
that to tell which one is better
is easier to really
to construct the data point from scratch.
So you can get kind of higher performance.
Can get better performance from
human draw from scratch.
I would see that,
I hope that
After sensory, we can see kind of new research emerged from kind of in computer vision,
which is, okay, how we go beyond human performance.
Sensory is close to that, but I would say that new learning paradigm is needed to go beyond
human performance for sensory tasks and for computer vision.
Yeah, now, just to add to that, this is, Patreon is only talking about images.
I think video is a whole other challenging beast and getting to that really,
automated data engine is something that we tried to do in SAM2.
We actually didn't get to that fully automated approach.
In SAM 1, we did.
We fully, as A1B data set that we released was fully annotated automatically.
We didn't really get to that in SAM 2 for video.
And in SAM 3 for video, I think there's still like a lot of room to push on this sort of
pseudo labeling for video and really be able to get to that same step change as we had on images.
What are the biggest changes to see the same step change in video that you've seen in images for automated data pipeline?
Yeah, yeah.
I would see that really kind of good video, large-nongued, kind of video-modal model.
So when we do sensory, it's kind of earlier this year or kind of last year, you can see that image,
not a multi-modal model is very good, but video-nature-modal model, I think really kind of it becomes good or practical.
later this year, like, kind of quince through this kind of model,
gets kind of roughly kind of okay in that stage.
So we have a good kind of base model to fine-tune on our data
and to get human performance for this recognition or verification task.
I would say that, you can see that we indeed definitely kind of
sensory like efforts in the perception side,
but we also need kind of this kind of multimodal,
not language model kind of efforts, kind of good foundation model on the kind of vision
on which side. I think it's ready. It's ready now.
Yeah, also video imitation is just so much more time intensive to get to that,
to be able to annotate enough data to train a verifier, like video mask annotation.
We just found it was like very time intensive. So maybe there are more efficient video
annotation strategies. I think there's, you know, a lot of exploration that could be done
there too. Yeah, you know, spending a bit of time on video. I wanted to also talk about, you know,
Obviously, last time we were focused a lot on memory attention.
I think this time there was this sort of mask-lit thing that I wanted to just get more ideas of,
or does they share the idea just generally?
What was it called?
The musklet detection?
Musclead detection score, exactly.
And how it's basically smoothing within a temporal window,
which I think basically, you know, a lot of computer vision models don't have this,
and they could just simply add it
and it would be a lot more stable
when it comes to video.
And I don't know why they don't do it.
Maybe I can comment on this.
First, why they didn't do that?
I think one big reason is
this streaming requirement.
You can see when you want to gather information
of course, the entire math nets,
then you need to wait for the mass nets
and then and kind of get this strategy.
So that will sacrifice some stream.
kind of capability. So you can see that the streaming requirement is kind of somehow
kind of limits we kind of traditional measure to do this. But I would see that this is definitely
kind of beneficial. The reason why is that I think even human do this, you can imagine that
when something just appears kind of at the corner of the video, like a hand appears at the corner
of the window, kind of the video, you just do not know whether this is a man or woman.
So, the few might even make mistakes.
Also, for essentially, it will make this mistake.
But when you get more and more information,
the person really enters the video fully,
then you get to know, okay, whether this is a man and woman.
So this kind of,
gather more information to really kind of nail
whether this concept is kind of the concepts you carry,
is the idea here.
So there is a treat off between,
kind of the latency and accuracy here.
If you care more about accuracy,
then you can use this overall information
can all cause the mass net
to get kind of more robust signal about the concept.
But if you care about latency,
then you need to make a decision in the very beginning,
and then you will sacrifice some accuracy.
I think also in many video use cases,
I think because if you were sharing on RoboFlow,
users care more about detecting the objects rather than having unique identities.
So in some cases, this isn't required to preserve the identities throughout the video and you just want to essentially do detection per frame.
Like for the Robo Flow rapid examples you are sharing.
Yeah, there's cases where being able to count and you know the objects are all going to be the same,
so you don't care as much about unique classes. You just want to know that.
the full presence, things like that matter.
But then there's other cases like you mentioned where, I don't know, like in sport,
you care about individual players versus just knowing that there's 11 players on the pitch.
One thing that might be useful actually to discuss with some of our time is we talked a little bit
about how Sam 3 and MLMs will play nicely together.
But there's probably like a greater discussion about how Sam 3 fits into a broader
AI ecosystem and like what bigger picture trends it might fit into.
Do you have some thoughts on what they,
this represents about where things are headed?
Maybe I could say one point and then Pengtron feel free to add.
As we mentioned before,
as we mentioned before,
Sam 3 isn't just a version bump.
We are really having a unified model
that can do many different tasks
in the same unified architecture.
And so, you know,
in the same way that LLMs can do many different tasks
without needing a task-specific model,
like with Sam 3,
We're able to do image, promptable concept segmentation, video,
promptable concept segmentation.
We don't need a specialist model for counting.
We can do interactivity.
There really is like multi-capability visual models that are on par or better than the
single task state-of-the-art models.
So that's really one place in which SAM3 fits into the AI ecosystem.
In terms of MLMs, I don't know if Pentron you want to talk about.
the agent approach?
Yeah, yeah, definitely.
I would, you can see,
let me give,
I would see that Sansaouet kind of now kind of really get a big step change in vision,
how it really helps the general AGI fit into general AGI or frontier model landscape.
It's very, very kind of exciting for me.
We always have this example,
kind of give this kind of six finger,
kind of hand up picture,
as how many fingers do we have in this picture?
and then you can imagine that with Sun 3,
then we can just kind of first detect how many fingers
we have that very robustly, kind of six fingers,
and then the multimodal model should know that,
okay, this is six finger hand instead of five finger.
You can see that the arrows made by frontier models
can be solved if we use kind of San 3 as a tool,
but then how really kind of is sensuil as a tool is the end
the picture or should really somehow sensory even just be
naturally embedded into this frontier models.
The frontier models have running the sensory capability by themselves.
I would see that there's a lot of possibilities there.
My picture is that now we have a very good green with this kind of frontier models.
And we have a very good eye with sensory.
Now let's see whether the eye really is kind of working together,
kind of natively with the brain together,
or is really kind of a different kind of organ and then need to somehow like that too to kind of work with the brain.
I think this is a very exciting kind of research area.
And so in your analogy, if you think about like the visual cortex compared to like a human, human brain,
like, you know, we have rods and cones in our eyes that do kind of very fast.
We joke like lizard brain level detection, simple stuff.
And then you have your brain that reasons about some of the visual information that your eyes see.
In your example of SAM3 as a tool call or SAM3 as natively a part of the multimodal models,
which future do you think is more likely?
I think, as in this, I want to bet on running their work natively together.
The future for simple, I would say for simple or even intermediate difficult vision tasks.
For example, counting with less than 20 objects.
I think for this kind of simple task, this is like system one,
kind of visual reasoning with our brain.
This should be kind of our brain,
and should do it by themselves.
But with very, very difficult paths,
you can see that if we are counting,
and maybe thousands of objects in the picture so crowded,
then we can even need to kind of draw something there.
I would see that at that time,
maybe we did some extra model for difficult tasks.
You can see that this is a high,
hybrid approach, but I'm more excited, I think for most of the cases should be native.
The reason why there is, you can think that I would see perception or grounding,
and I really kind of know where it is, how many it is, it's like a fundamental capability
of our brain. I'm just not happy that the frontier model just cannot count how many fingers
immediately, and instead of need to call a tool to do that. I think this kind of,
should be system one thing, and this should be kind of natively in our brain. And also,
if our brain cannot do this task, which means that it's definitely kind of missing some kind of
very critical kind of visual capability by itself. So that's kind of, I would say that it's just
feels that the intuition just feels that it's not correct to do not have this capability by itself.
So for very simple system one questions, things like how many fingers on a hand, that should be native.
maybe more complex things that are maybe long-running tasks and long-running reasoning,
then maybe there's a bit more of like a tool-call approach.
Yeah, yeah.
Exactly.
For example, you can see that we already, kind of in our sensory agents or in our AI
annotator, we even demonstrate this approach.
For simple cases, the model can do it by self that, okay, I can detect, for example, 10 people
here.
And then the large language model can even, the AI annotator can even know that, okay,
this 10 people is not.
exhaustive. Okay, there are more people there. So if you want to do kind of well, then maybe
kind of you need to do more step, for example, to call an expert model. So you can see that this
is a very, very kind of native kind of, kind of reasoning process for more advanced or complicated
vision questions. I have a related but maybe slightly different question. M3 is an incredibly
powerful piece of work. And it's open source as a part of now MSC,
So open source critical to achieving AGI?
Maybe I can comment on SAM specifically,
but in SAM 3, we did leverage many of the open source contributions
people have made on top of SAM2.
There were new data sets, there were new benchmarks,
there were new kind of inference time optimizations.
We adopt a lot of the things that the community built
on top of the models, on top of the data sets.
And so all those contributions helped make Sam 3.
For Sam series, we've really benefited a lot from, you know,
being very generous with what we open source and then leveraging what the community builds
on top of that.
But that's just from the Sam perspective.
I think it's clear what the community brings and offers.
And I think, you know, every time we do this, we always shout out to the community to, like,
you know, try it on their use cases and record, like, weird.
findings and like, you know, if it doesn't do what you are trying to make it do, well, let's talk
about it, right? And then maybe sort of implement it in the next version. Like you already said,
Plinchman, you already think to that, like what might be coming for SAP 4, which is at least
a little bit more of the document and OCR work. Any other directions are interesting. I guess
obviously a lot more video work as well. What is the talk of the town in like the CV community
that like, you know, it would be really great or like super obvious. Like next year is going to be
the year of what? Yeah. Maybe kind of. I.
can first talk something and then you can I can add first definitely gonna I think even
it's not simple it's sans three something and sensory point something like small models
since three currently only have really kind of one model kind of one size model can
more kind of efficient model that's kind of fit for kind of eight cases and also kind of a more
efficient model for video I think currently kind of the video model is not efficient you either
you can achieve very good kind of throughput but you need GPUs to do that
So first kind of small and efficient models, that's one big thing.
The second big thing is definitely kind of video.
Robafo can do that for you.
Yeah.
The second thing is video.
I would see that way video is still far from, I would see have a big gap from human performance.
Right now there's kind of still kind of a lot of research need to be done there,
how to do and to the training with video we do not have, and kind of we have this.
is going to decoupled approach, but we do not end-to-end train this model,
and we expect definitely, it will be kind of benefit from kind of end-train training.
And also, as we just kind of on video side, really kind of how to scale up the data engine,
we need definitely kind of AI annotators for video. We tried that, but yeah, we can,
I think that's something and definitely works well to do.
The third one, we also discussed about that all sense, how perception fit into AI.
this big landscape.
Now we have the eye, how the eye
work with the brain to do
yourself, real reasoning path.
Not only output segmentation, but
really kind of answer how many cases
are here, or even answer the question,
okay, I have an example
of biology labs.
The robots need to decide whether
they can liquid in the test tube
at the correct level or not.
You can see that this
is kind of evolved perception, but also
involve reasoning, how to kind of solve this more kind of visual reasoning task with
Sam is kind of a very big direction.
On the robotics topic, it was exciting to hear from like several friends that work at,
you know, different robotics companies on how they're like immediately starting to use Sam
three. And I think especially for the video use case, I think robotics is probably one of the
domains where I think improving video performance will have a lot of impact. And so I think,
Yeah, that's definitely an area that we could improve on further.
But, yeah, depending to one's point,
I think there's still another step change to be achieved on video PCS.
Yeah, just a quick comment on the robotics things.
We're interviewing a bunch of robotics folks here,
as well as, like, Fay-Fei, who obviously started ImageNet.
A lot of people are betting on explicit world models,
and Sam is not, for better or worse.
And I wonder when that crossover might happen.
There's an open question if you guys want to take any world models.
models discussions, re where things are going based on like community questions.
Similar to how Nekila mentioned Dr. Sam 1, the like almost obvious thing that people
wanted was like open concepts prompting because people are like, great, this model can see
things, but I want to tell it what I wanted to see. And now with the introduction of Sam 3,
you have this stepwise component, which feels like a key component of, you know, the chat
GPT era for vision is arriving as a result. What's going to happen is now you're
you've provided people with an open text box and media. And so you're going to get all sorts of queries
from people that maybe the model isn't primed to be able to perform particularly well on yet. For example,
earlier we were talking about document understanding and document reasoning being a place where there's
known improvements to be made. And so you'll have people that will probably prompt to try to OCR things,
or you'll have people that want to do work with spatial reasoning. Like give me the object to the left
of this other object or give me a sense of where things are in relation to one another, which is
critical for robotics like we're discussing because that's how you navigate throughout the real
world. You'll also have, I think, people will want action recognition and vision language action
models, VLAs. Like the same things that, where you have these tasks where people are used to
providing open text prompts and getting, here's the part of the scene where the player kicked the
ball or the tennis player made the serve, those are interesting for the purposes of how to
understand and synthesize visual inputs. And so now that you've kind of given this open text box for
media, there's going to be a flood of the types of things users are going to want to try to do,
some of which Sam is already going to be really well adapted to do, some of which not. And I think
that that's going to be, it's going to reveal itself of the types of things that are obvious.
One of the things that we wanted to discuss was like where to use Sam and discover how to build
with Sam. So in addition to the meta team building a tremendous playground for being able to
interact with images and video and kind of apply effects for like a video emphasis.
I think one of the things that we're pretty excited about with Sam 3 is how much
it positively impacts each part of building a system for visual understanding.
So for example, the very first step of historically aggregating and collecting a data set
because you think that there's not a model that understands the slice of the world that you want
to understand is where automating away, lots of labeling can exist.
Basically, if you collected a bunch of data of something that is already in the SAM3's knowledge,
then you can prompt for SAM3 to automatically label all that data for you.
And so we've actually made a bet on SAM3 being a core part of auto label at RobloFlow,
given users a first pass of saying, hey, if you have a new image or you have a new video,
start providing just a text prompt and allow SAM3 to find and automatically label those regions of interest for you.
Downstream, I think there's areas for fine-tuning.
Like, you know, within a week of releasing Sam 3, med-sam-3 came out for adapting Sam into
medical contexts.
And I think that's a harbinger of what's to come.
Like, there will be lots of domain-specific adaptations of Sam in places where maybe there's
a specific ontology that someone wants to understand, or maybe there's a place where just the model
doesn't have great awareness yet.
And I think we're already beginning to see that with hundreds of fine tunes that users are
creating for various domains.
And then the last area is like, okay, I've got my model, now I want to use it.
And so one of the things that we're really proud of is to be ready on launch data,
showcase the infrastructure we've built to burst and scale like infinitely large as folks
have models that they want to deploy and make it readily available.
Having an endpoint that serves either a fine-tuned model or a model as is,
or even a model that might be able to run on edge hardware as smaller models come out
or maybe distillation comes to rise is, I think also an awesome place of where we're seeing
Sam 3 being impactful at each part of like the computer vision lifecycle and pipeline.
That's awesome. Yeah, I think especially the impact on speeding up annotation,
I think we've seen that consistently on RoboFlow.
And I'm really curious to see how Sam through the introduction of Sam 3 really helps speed up that process even further.
I mean, just from playing around with it, it's so much faster than having to manually annotate every single object.
So, yeah, you're really curious to see how that improves the experience.
One of the things that we were pretty excited about is we were kind of able to build an entirely new product in the world of Sam 3.
And we called it rapid, but basically it's like there's probably a model that already understands the objects in the world that you want to see.
So here I'm scree sharing an example of like these are vehicles next to our office in San Francisco that go by.
And you can see here's a Waymo and here's like other vehicles.
And like if I just have like this 10 second clip and let's say, you know, the first thing I want to do maybe is just like
count cars and I want to get a sense of each of the vehicles. What's really awesome is I can just,
you know, of course, text prompt and say I want vehicle. And as I toggle through different frames
in my video, Sam 3 already recognizes and understands those objects. Now, one thing that I think is
really interesting, there was a conversation earlier about how much you want to rely on a model
versus human's output of the model for what you care about. So, for example, let's pretend in
this scene, maybe the only cars that we care about are the ones that are like before the crosswalk
and maybe not far in the distance, then you'd get people that would say, hey, you know what,
I actually want the objects that are like most confident. And I would like, you know,
move my slider down to like getting a fewer number of objects. Whereas maybe others might say,
hey, I want like every single presence of a potential object in the scene, which even gets like
reflections on the building of objects. As computer vision approaches this world where we increasingly
have like models that can understand and improve themselves and we rely on what human output
and human preference from the models is we're going to get these funny scenarios where things
aren't all like immediately deterministic of what a human cares about and I think that's where like
tooling feels a big gap but it also is going to be a place where it'll be really interesting
to see where users kind of start to use and apply the models and why you need so that this last
mile work to put the model in context in the domain that someone is trying to solve and tackle.
So let me, since you're here, right?
This is one of those things where I'm like, I'm not sure this concept, concept, the concept
of labeling concepts can scale only because I don't know if I ever, if this slider between
less and more is the way.
if ultimately I need to tell you
whether or not to include
reflections, right?
Because in reflections, sometimes it's great.
That's exactly what I want.
Most of the time it's not going to be what I want.
I don't know if some RLHF thing
is going to solve any of that
because you just need more prompting.
Just saying vehicle is not going to do it.
Yeah, I don't know.
Feel free to disagree.
You can't imagine such a type plan coming,
for example, as kind of SWIC said that maybe,
the reflection is exactly what I want,
then you need some kind of iterations with the interface or the model
or to get finally what you need.
So you need to specify the concepts kind of more clearly
through multiple iterations.
Can human not be involved in this iteration,
but just kind of models just kind of do it automatically?
I think that's kind of something,
definitely going to, it's,
I would see I'm quite interesting
that you can imagine this workflow,
and I want, kind of,
reflections, and then I can
kind of, with the default,
kind of threshold, maybe the model
will get an output.
Then another kind of very strong
perception model on other kind of,
like Jim Nestlewe,
we'll then kind of ask,
we ask Jim Nestlew,
whether there's kind of some reflection there.
And it says, yes.
then you can see that we can automatically
it's not going to move the threshold,
they're going to lower, and we're going to ask
again again to see whether the reflections
not included or not.
So somehow this process can possibly should be done
completely with AI,
going to unless, yeah, yeah, exactly.
So for now, the answer is image.
And we can sort of tie it closer.
I think Joseph is showing us the sort of
Wimo annotation. Yeah, it's nice. Now you have a WIMO model.
Yeah, I was just doing an example where maybe we want to find an object that's not already
represented in the training data.
I think, I think prompting can solve, yeah, I think prompting could solve the problem
of, like, reflections, because maybe you could say, like, vehicles on the street.
But to your point, like, you would have to, like, see that that's a failure case, right?
Like, if I was, like, just setting up a camera and saying count cars, I wouldn't anticipate
realizing that reflection could be a problem.
And so I think this is why, like, in some ways,
human in the loop, because identifying human intention,
not necessarily human knowledge is what's going to be important
for a lot of last mile use.
But yeah, I'm pretty excited about.
Yeah, maybe I want to echo kind of what Joseph said.
Actually, also my experience,
just different people have quite different kind of definition
of even a visual concept.
for example, for some kind of data set, even hand.
Some people would like to just kind of annotate the palm kind of pad as kind of their hand.
And some people will kind of include the armor, kind of also kind of as hand.
Then when we kind of first test three on some very kind of customized data set,
we found, okay, the performance is not that good.
And when we kind of finally look into kind of the kind of performance,
we found, okay, this is kind of just the user have a different definition or explanation of the concepts.
but both explanations are okay.
Then in this case, you can see that
really need a few mind in the loop
to do the kind of few short fun tuning
or to adapt to the user's definition of this concept.
That's exactly right.
It's not always like deterministic of what someone really wants,
which is why I think like,
even if you have a fully comprehensive omniscient model,
putting the model into the context of what the user's trying to do
is where a lot of tooling and infrastructure becomes
really, really helpful.
Anyway, I found our Waymos.
You continue to build
excellent tooling for vision, and I think
the world is very grateful
for that. Let's get to
call to action. I think
we've sort of given a good
overview, and people obviously should read the paper
and try out the playground, try out RoboFlow.
Is there interested in diving deeper?
What is there a call to action from
each of you?
I mean, try the demo,
try the code.
We've got a lot of resources on GitHub repo.
It's a very well managed launch, by the way.
Kudos.
I don't know.
It probably takes a lot of effort just on the launch itself,
even after the model's done.
Yeah, and actually just on that,
maybe one thing,
just shout out to the whole team.
I think this is,
um,
three was our biggest and most ambitious project to date.
And it really took a huge team of scientists,
engineers,
interns, software engineers,
you know,
across,
across the company. So, you know, really huge shout out to the entire team that made not just
the model successful, but also the demo and then all the launch and everything. So it was a huge team
effort. Definitely, like, would love to hear from people on what you're using the models for,
where it's failing, you know, raise GitHub issues, messages on Twitter. We'd love to hear from you
on where we should go next as well. Yeah. And on top of that, definitely kind of try out also our
benchmark, the cycle benchmark.
would say that it's like me that the benchmark will last longer than our sensory model.
Maybe next year there will be a stronger model, but the benchmark is the one that I hope to guide
the community to get better and better models, kind of to get to a kind of way major human
performance on the benchmark. I think maybe we are the first one to do that for this kind of
very kind of segmentation and the kind of video or kind of Guangding and past.
It's very difficult to measure human performance on this task.
Hopefully, this benchmark unguides the community to achieve human performance for this task
and even going to surpass human performance there.
We set out to be one of the best places, if not the best place,
to build with SAM 3 and the SAM family models.
So we're going to see what people build with SAM and computer vision models to move the whole field forward.
We have infrastructure for everything from deploying SAM3-Zero shot to making your own fine tunes,
to automating labeling of data with SAM.
And we continue to see the impact with each subsequent release expand the number of use cases and the amount of use and accelerate the time to value.
So excited to see what folks can build on RobloFlow with Sam.
Thank you all so much.
This is a really great company.
It's great work.
And just obviously always expands my mind as to what is possible with machine running.
Yeah, I mean, we're not at ASI yet or AI yet.
But every day we're getting closer.
Awesome.
Thank you so much.
Thank you.
Thank you.
Thank you.
