Embedded - 253: We’ll Pay Them in Fun
Episode Date: July 13, 2018We spoke with Kathleen Tuite (@kaflurbaleen) about augmented reality, computer vision, games with a purpose, and meetups. Kathleen’s personal site (filled with many interesting projects we didn’t ...talk about) is SuperFireTruck.com. Kathleen works for GrokStyle, a company that lets you find furniture you like based on what you see. GrokStyle is used in the Augmented Reality try-it-at-home IKEA Place app. Theory of Fun for Game Designby Raph Koster Flow: The Psychology of Optimal Experienceby Mihaly Csikszentmihalyi Language translating/learning app and online game is Duolingo TensorFlow in Javascript HCOMP 2018: Human Computer Conference with Keynote by Zooniverse’s Lucy Fortson(no video for that yet but we hope)
Transcript
Discussion (0)
Hello, this is Embedded.
I am Alicia White, here with Christopher White,
and this week, here also with Kathleen Toot.
We're going to dive into computer vision, augmented reality games, and meetups.
Hi, Kathleen. Thanks for joining us.
Hello, I'm happy to be here. Could you tell us
about yourself as though we met at a technical conference? Okay, my name is Kathleen Toot,
as you just said, and I am currently a software engineer at a computer vision AI company called
GrokStyle. My background from this current job and past projects uh involves computer vision game design
crowdsourcing human computer interaction all those things kind of wrapped up together and um what i
really like doing in general is like taking interesting computer vision systems and building
interactive like other things around them that people can actually use and play with. Yes, we have so much to talk about.
First, we have Lightning Round. You've heard the show, so you know
the goal is fast and snappy.
Do you want to start? Okay, Christopher is shaking his head, which goes
over well in podcast land minecraft or pokemon go um i like both of them a lot favorite open cv
function i don't like open cv as much favorite open cv function um i i one thing i do like about
open cv i like just like reading in an image and then also displaying it.
So the ones that read images and the ones that show stuff, those are probably my two favorites,
just so I know what's going on and that I can get started and build something on top of that.
Favorite computer vision library, then? library then um so when i was a grad student i did a lot of stuff with uh this this tool this
system called bundler which is a structure for motion pipeline so i'm pretty fond i love hate
relationship with bundler um and now there's a there's a python version of it called OpenSFM that is run by this company called Mapillary.
OpenSFM. Okay.
Is it my turn or yours?
It's my turn. Oh, go, go, go.
Favorite VR game?
I went to a
like a
capstone demo
Sammy Showcase at UC Santa Cruz
a bunch of games that students have been working on this past year.
And I played this like ghost cat VR game where you're a little ghost cat and
you have another buddy ghost cat and you have to stay near each other.
Cause they're your source of illumination.
And then you have to like jump around and get to other places.
And so that ghost cat VR game.
That no one else can get to other places and so that ghost cat vr game that no one else can get no one else i mean they're trying to make like a research platform out of it so maybe you can play it soon
i don't know that's like the only recent thing that i can think of tip every you think everyone um stop writing bugs just stop it okay so uh um a long time ago I was like programming stuff with
my husband my boyfriend at the time and um and like I was just kind of being sloppy with with
what I was doing and I'd like write some code and then I'd run it.
And then I'd, like, read the error and, like, oh, I spelled that thing wrong.
And it was just, like, this really slow process.
And he was, like, just stop writing these bugs.
I was, like, okay, I'm going to try.
I'm going to be more mindful about this and just, like, go a little bit slower.
And think, like, I'm a human.
I can do this as best I can and, like, try to just not write the bugs in the first place.
And then whatever errors do come up,
they're still there for me to figure out,
but the really basic ones, I can kind of just try not to do that.
Yes, I know what you mean.
That's when monkey coding is what it is for me when I do it,
where I'm just like, oh, I'm just going to type at this until it works.
Yeah.
And I'm not going to sit there and think about how it should work and how I can get from where it doesn't work to where it does work.
I'm just going to keep incrementing this variable until the timeout is the right length.
Yeah, yeah, just like stumbling through it.
And that works sometimes.
Thinking about it actually is kind of better.
Actually better in so many ways.
Stop writing bugs. That's great advice.
Computer vision.
When people say computer vision, what do they mean?
So I would say computer vision is the ability for a computer that's gotten some sensor, some picture of the real world.
Maybe it's like a picture from a normal RGB camera.
Maybe it's a depth sensor some more enhanced picture um the computer's ability to like make sense of that and understand what is going on in that scene whether it's recognizing the
objects in the scene or the activity that's happening um or just like more information about
what the scene really represents um a facial expression that someone has, or who the identity
of a person is, those are all like computer vision things, like the ability to understand
things about the real world from a picture or a movie or something kind of like a picture,
like a depth.
I like the way you put it, because it is about the computer, not just acquiring the data,
but being able to do something.
You said understand, which computers don't usually do, but it's that level.
It's an intelligence.
Yeah, to make sense of it enough at some, whatever understanding level is possible,
that then you can actually use that in some other system.
You did graduate research.
And I know I'm not going to get the word right.
Photogrammetry?
Photogrammetry.
Photogrammetry.
I missed the R.
Okay.
What is that?
So photogrammetry is the ability to get a bunch of images of one thing all from different angles and kind of come up with
the 3D structure of the item that all the cameras are looking at and also the pose of each of the
cameras, how they relate to one another and to the object themselves. And so if I go and I take
a picture of the Eiffel Tower and then I take another picture of the Eiffel Tower, and then I take another picture of the Eiffel Tower, you can build the Eiffel Tower in 3D from my photos?
Yeah, yeah.
But two isn't enough? close enough that they're seeing roughly the same view, but they're also like spread out enough that
they're not exactly on top of each other. And you can get some kind of 3D information from them
being split apart. You could know where the pictures are, where the Eiffel Tower is, but if
you want to go further and get like a fuller 3D model of the Eiffel Tower, you'd want many pictures
of it from many different angles.
And that might be enough to fill in,
um,
the actual structure of that object.
Although the Eiffel tower has a lot of like,
uh,
like cross braces and things where you can see through it.
And that might,
that will probably be a little bit challenging for the computer to,
to make sense of like,
uh,
something. how about the
washington monument um the washington monument that has not enough detail right it's yeah and
it kind of looks the same from all four sides the tall one uh the canonical examples of this
where like structure for motion photogrammetry sort of became a thing that
other people started really running with was uh this photo tour photo tourism project at uw where
they took a bunch of photos from um flickr of popular tourist places like notre dame cathedral
and trevi fountain in rome and used those photos. So those are places that there's enough texture and structure there,
but it is kind of this continuous surface that you can't see through, like the Eiffel Tower.
So it's better if you have something that has a lot of detail,
but not see-through and not repeating detail.
Right.
This is a lot of caveats. It definitely not repeating detail. Right.
This is a lot of caveats.
It definitely is, yeah.
Okay.
And then, so the way I think it happens, like my mental image is you have picture one,
and then you take picture two,
and you try to map up all of the same points.
And then you take picture three, and you map up all the same points on picture one and picture two.
But then I'm kind of lost.
I mean, I know you can use convolution to map up points that are the same,
but what happens after that?
Is that even right?
That is totally the first step of getting two images or more images or pairs of images in a whole big collection of images and figuring out what all the interesting points in these images are and these like 10 images and it was in these pixel coordinates of these
images over here and you just have a whole bunch of data of like those correspondences
and then you throw it into something called bundle adjustment and that will figure out
the 3d positioning of where all those points should be in 3D space and where the cameras should be,
like what pose they should have based on all these
camera pinhole math equations there.
Okay, we're going to ask you about that too.
So don't get comfortable with me skipping that.
But even that first step,
are you using the RGB images
or are you trying to find vertices? I know they're,
what, what kind of algorithms do you use to even find the points? And then what does this
bundle thing do? So the algorithm to find the points initially, um, SIFT is a good one. And I
know, I think your, your typing robot uses the same SIFT feature points to figure some stuff out.
It does, it does.
But when I did it, I just used OpenCV, and it magically worked.
I have no idea what the algorithm was. That was part of when I was trying to figure out where the keys were.
And I had a perfect image of a keyboard.
And then I had my current camera image of the keyboard.
And it was SIFT and FLAN and homography.
And I just typed them in.
And wow, it just found them.
And I did nothing.
Even when I changed the lighting, it was pretty good.
So what does it do?
So to break it down a bit more,
SIFT stands for Scale Invariant Feature Transform.
Yeah, transform sounds good.
And basically for a computer to start understanding an image, Yeah, transform sounds good. a building and there's a windowsill and the part where like the outside of the windowsill comes
together in an angle and it's on top of this like there's a brick facade or something so that the
the sill and the brick are different colors and the light is casting shadows in a certain way
like that particular corner of that building might have a, or will have a distinctive look. And,
um, the sift feature of that particular point would capture something about the colors there,
but more importantly, the edges and like what angle they're at and how strong,
how, how edgy they are, how cornery they are. And, um, and the scale invariant part of sift means that if you have a picture of
that windowsill up close and you have another one that's maybe far away and maybe it's like
rotated a little bit that that particular piece of those two images will still look
very similar like it will have a descriptive way that the computer can represent it that it can
tell that they're the same point okay okay so now we found all of these point correspondence
points correspondence yeah i mean they start out as just like these are feature points these are
points of interest these are little little corners or things that a computer can can say like i i
know what that is i know where it is versus on a like a plain blank
wall there's nothing special about like a pixel um in the middle of that space it could be anywhere
um and then when you have multiple images like two images that both have sift points and you
kind of figure out the correspondence between them that's when the correspondence part
comes in and so i can i can sort of understand with two images,
because that's kind of how my eyes work. It's 3D vision.
And if my eyes were further apart,
if I had a really big head, I would be able to see 3D vision
further away. But right now, after about 10 feet, everything's kind of
flat.
I know there's actual math that would tell me how far it is.
But realistically, I'm pretty cross-eyed.
So 10 feet is really about it for me.
Don't play basketball with me.
And so when you have two photos taken apart, far apart, then you can get more depth.
Yes.
But my eyes work because they're always in the same place.
They always have the same distance between them.
It seems like a chicken and an egg problem that you can find these points and you can find the 3D-ness of it, but you also find where they are.
How do you, which one's the chicken and which one's the egg and
which one comes first um so you're totally right that like our our eyes where our brains have
calibrated the fact that these eyes that we have are always in the same relative position to one
another and um and i i think like 3d reconstruction techniques from two images have existed for a
while and they've started out with we need to calibrate these two images relative to each other
first like they're going to be mounted on some piece of hardware and they're never going to
change and if some intern bonks them like then they have to go recalibrate the whole thing. Yeah. Yeah, I remember doing, yeah.
And they have these, like, calibration checkerboards that you can set up.
And there's probably some OpenCV function for, like, look at this checkerboard and figure it out.
Like, figure out what the camera is.
There totally is.
Yeah.
So, getting from two cameras where you've calibrated them already, and also you have to calibrate like the internal, like lens distortion and all of that of a camera.
And that's where the checkerboards come in.
Um, but having more cameras, yeah, you need to figure out like, what is the 3D structure of the points I'm looking at that will help me figure out where the cameras are.
And you also need to figure out where the cameras are to figure out where the 3d
points that you're looking at are and um and what this bundle adjustment technique is well i guess
there's you had mentioned homography or alluded to it um homography is like an an initialization
step of there's two cameras and they're looking at the same thing.
And if that thing is like a planar surface, it's kind of understanding the relationship between those two cameras.
Yes, in my typing robot, I have the keyboard, the perfect keyboard.
And then I have my scene of however I put the camera up today.
And then the homography, I take a picture and it like maps the escape key onto the escape key and
the space key onto the space key. And then it gives me the matrix that I can use to transform from my perfect keyboard world to my current image world. And so that
matrix I can then just use to transfer coordinates between them. Right. So if you have all these
pictures that tourists took of the Eiffel Tower, you can look at the pairs of cameras and look at the SIFT correspondence points
that you found between them and kind of estimate a homography by like, what is that matrix that
says how this one camera moves to become the other camera? And it might not be perfect because the
points that you're looking at in the world, there's maybe stuff that you don't you don't have
enough information yet you don't know what the internal camera parameters are for that particular
camera but you can get some initial guess and then um what bundle adjustment does is takes all of
these initial guesses of how all these cameras and points and tracks of points seen by multiple cameras fit together,
and it kind of comes up with an optimization
that solves for both of those things at the same time.
So it takes all of the correspondence points for each pair,
and then it minimizes the error for all of them.
Yeah.
And so if you end up with a bogus pair,
like on my keyboard if I was mapping A to Q,
if I took a bunch of pictures, it would eventually toss that one because nobody else agreed with it.
Yeah, it might toss it.
Or it might be like, I think this is right, and it might just be wrong.
And then it skews everything.
Yeah.
So in this project that I worked on in grad school called Photo City,
which was a game for having people take lots of photos of buildings and make 3D models,
I saw a lot of this 3D reconstruction stuff gone wrong, where a person would take photos and the building the wall of the building would grow but
then it would just like curve off into the ground or just like the model would just like totally
flip out and and fall apart because the the like this bundle adjustment this effort to kind of
figure out cleanly where everything goes would just get really confused or sometimes
there would be like itsy bitsy teeny tiny upside down versions of of a model that were like really
close because the computer was like this this makes sense to make like a tiny version of this
uh this building here it kind of looks the same as having like one that's really far away
yeah i mean you get a discoloration in a building that has bricks,
and then you end up with the small discoloration
of the bricks, and it can't
tell the difference because size
and variant.
Yeah.
Computers, man.
They mess up sometimes.
When you do the minimization problem
of finding all the matrices,
which gives you the 3D aspect, that's when you can start figuring out where the people are.
Because you can backtrack.
Once you're confident that these points are in this space, you can backtrack to where the camera person must have been.
It's doing both at the same time and kind of going back and forth between optimizing where
the points are and optimizing where the cameras or the people holding the cameras must have been.
And you can say, I have a pretty good guess of where the 3D points out in the world are.
But if I wiggle the cameras around a little bit, then we'll come up with a better configuration
that minimizes that error even more
and um and that the error that we're trying to minimize is like do these points in the world
do they project back onto the right pixel coordinate of the image or are they they off
we're trying to sort of get everything to make sense um across all these different pictures. And in the end, this is a massive linear algebra problem.
Yeah, pretty much.
That's weird.
I mean, it sounds like you put photos in,
you get locations and 3D out,
and so it sounds so smart,
but in the end, it's just massive amounts
of A plus B, X plus C, Y.
Yeah, yeah. It's totally like magic that this is possible, but it's also totally not magic.
It's just like just a bunch of math.
It used to be whenever we were doing computer vision stuff or machine vision or whatever we were calling it, there was the requirement that things be lit very brightly.
That went away.
Why did that go away?
How did that go away?
That was like the core thing with object identification and location.
When did lighting stop mattering?
Or does it still and I'm just using better tools?
There can be a number of things involved.
Lighting still matters, but SIFT is pretty good at matching things even when the lighting is a bit different.
Another big thing might be that the quality of cameras that we have is better now.
Like the webcam that you have or the camera on your phone or the camera that's built into your
laptop, those can, they can work better in like lower lighting, crappy lighting. They will also
just take clearer pictures. So I imagine that it was more critical in the past
because having like the cameras just like couldn't see very well. And so you really had to
make it easy for the cameras. And then a third aspect is that we have a bunch of data online
taken by cameras. And, um, and so there's a lot more that we can do to say crappy cameras not very good
cameras um um and we can like learn more from all of this data that's available so we can kind of
compensate for the fact that the lighting might not be as good because we've seen enough examples
of something with not very good lighting
that we can still understand what it's supposed to be.
It's interesting that it's the camera technology
that is one of the drivers.
I hadn't really...
It's probably the application, too,
because if you're doing a manufacturing thing,
you want everything to be exactly the same all the time.
So, okay, we have good lighting and we know the lux and everything.
Every time, just know the circumstances don't change.
Whereas for a more general vision application,
you might be taking pictures anywhere.
And so you have to be able to adapt.
Yeah.
If you don't have to be able to adapt, then it's easier, right?
Yeah, yeah.
Like, because there's like,
that's working well
enough this technology of like taking a picture and adding to a model or taking a picture like
recognizing some object in it um and those are getting into the hands of consumers uh you're
totally right that that now people want to use that in a wider variety of applications so it's
kind of pushing the limits of like we need to work on making this better we need to work on making it still
figure out what it's doing even if it's some random person taking a picture in their like
dark living room and i think that has gone back to the manufacturing areas that even there you
don't need the bright lights because we've learned to adjust to people taking pictures. It's cheaper not to have to do that.
You can use consumer-level stuff, yeah.
Yeah.
Okay, so at the end of taking a bunch of pictures,
you get a bunch of points on your Notre Dame
or your Eiffel Tower,
although we agreed that was kind of iffy.
And then you get the location of the people.
Which one is more important, and what do you do with it then?
I mean, part of me is like, oh, this is a surveillance thing.
I should never take another photo in my life.
The locations of the camera is probably a lot there's more information there because you
can understand where the people were who was taking these these pictures where they were
standing where people can can go um like the points themselves there might not be enough of them to really
do something like like the points on notre dame or the points on the eiffel tower they it's kind
of like okay now we have a crummy point cloud of this this place and we could just get our 3d model
of that object another way um but then to know where all the humans were standing and um there's
a project that was like a follow-up
to this photo tourism project of looking where people walk in the world when they're taking
pictures of things. And they like made a little map of people walking into the Pantheon and where
most people took photos. And you could see that you'd like walk in and you kind of go to the right
and lots of people would take photos right when they got in of like the ceiling and other stuff and then
they'd walk around and the amount of photos that they took kind of trailed off collectively because
people just got it out of the way at the beginning and uh and i went to the pantheon in rome and i
was like i've never been in this building but i i know what to expect where people are going to flow
in this space and where everyone's going to
be taking pictures. And sure enough, like you go inside and you're routed around to the right in
like a counterclockwise position, and all these tourists are pointing at the ceiling in the
beginning and not so much at the end. Museums could use this to figure out which artworks
are getting the most attention.
I mean, I guess just the number of pictures taken of each artwork, but where people stand,
there are a lot of times where how the crowd moves is an interesting problem.
But that was not what I asked you there for.
Now I totally want to talk about that.
Building the 3D models.
That was what you were doing.
You were taking the point clouds and making 3D models, right?
Yeah.
I mean, I was building this game,
this crowdsourcing platform
around this structure for motion system
where people could be empowered to go take pictures of wherever
and make 3D models of wherever.
So in some sense, it was about getting the 3D models,
but it was also about just like,
how do we get an entirely new kind of data that doesn't exist online already?
But that data does exist online not really like we have a bunch of pictures of the
front of all these fancy tourist buildings but we don't have enough around the side like the
like people aren't going to be like walking down some alley taking a bunch of pictures on their
vacation unless they're they're playing photo city and they're gonna or they're they're they're doing some other like crowdsource street view
thing like like mapillary which i mentioned before um but the data it's not there there's
there's gaps in what is what people have taken just of their own accord and posted online
this is something that i have heard you speak on some,
that the data we have for so many things is,
I mean, biased, even visually,
but biased in all kinds of ways with gaps.
And you want to gamify filling in the gaps.
Yes. That's cool. you want to gamify filling in the gaps yes
that's cool weird strange cool how do you how do you convince humans that they should
help their robot overlords get more data and understand just the world around them better,
there can be better applications built for humans to use in our daily lives.
Can you give me examples of gamification of this sort of thing um oh there's like two two tangents here
uh like one part is is about gamification and one part is about how like just applications built on
data ai applications like there's data out there and then people try to use it and it works for some things,
but it doesn't work for other things.
And like it needs, there's needs to be more data that directly relates to what a person
is trying to do.
And because there's some system of some like human trying to do something and an AI system
isn't working for them or it works sometimes, maybe that can turn into a fun game.
Like what is the computer good at knowing? What is it not good at knowing? How can I stump the
computer? So an example of things that they may not like be called games, but they're kind of
game-like. A couple of years ago, there was this, this how old robot, like age guessing thing that Microsoft put out where you uploaded a picture of your face and it found the face in the image and then it estimated an age for that. because it would either have some really accurate response
or it would have some really hilariously wrong response,
like, oh, this picture of Gandalf says he's like 99 years old,
like, ha ha ha, or this picture of me
like says I'm way younger than I actually am, how flattering,
or kind of funny things like that.
People found ways to play with it and figure out all its limitations and what its capabilities were.
And they kind of had this communication around it.
Last week we talked to Katie Malone about AI and one of the
things we talked about was fooling
the AI and the
Labrador puppies and
the chicken image.
The fried chicken.
Where the AI is confused
as to which things are dogs. And there's a whole
set of dogs or not.
Like chihuahuas that look
like blueberry muffins. I loved those. Yeah. Yeah. Like chihuahuas that look like blueberry muffins.
I loved those.
Although when I told the chihuahua owner that their dog was a cute blueberry muffin, they
totally didn't get it.
Oh, man.
Yeah.
Okay.
So there's the fun aspect of making fun of the computer.
And also trying to help it along and like oh you're i want to help teach you
to to do better and like if we can kind of elevate what uh what computers are capable of then there
might be areas where then we are suddenly more powerful more capable because now we have these better trained tools at our disposal
okay so so there's the aspect of wanting to train slash one up the ais and then there's
straight up gamification yeah that's where you compete with other people to provide the AI with more information. Yes.
So there's like a history of gamification,
especially regarding data collection.
There's a,
there's a series of games or there's a genre called GWAPs or games with a
purpose.
GWAPs, really? That's how we're gonna pronounce that
yeah i thought it was g waps a g waps okay um games with a purpose okay and
i i i actually like like i like i've built games with a purpose, but I also am highly critical of games with a purpose and gamification.
And when it's done shallowly and when it's like, oh, we'll just sprinkle points and leaderboards and badges on top of something to try to get people to do this task for us for free.
We'll pay them in fun.
And sometimes it's not fun.
The game wasn't designed very well.
It doesn't make sense to be a game.
There's many cases where maybe you should just build some task on Mechanical Turk
and pay people fairly to do that task
instead of trying to go in this like roundabout game way
okay so so you're ambivalent about gamification and i totally understand that what would make
it be done well i mean what what are the hallmarks of actual fun so okay um there's a book by raf koster called the theory of fun and
one of the the ideas of that book is that learning is what makes games fun there's some picture in
the book it has lots of pictures it's got like kittens rolling around and it says like the young
of all species play and like kids and kittens and
puppies are are playing but they're learning a ton as they're playing and one of the i think a
thing that's that like almost basically every game has is you're learning the mechanics of that game
you're learning the rules you're learning the system and you start out like not knowing that game, but that game will help you gain the skills that you need to do more interesting things in that game.
And this also fits into this theory of flow by this guy with a name that I can't pronounce.
It's like Chixiameth.
It has a lot of C's and Z's and H's and stuff in it, and I can look it up later.
But this idea that...
Wow, that is a lot of...
Mihaly?
Csikszentmihalyi.
Yeah, flow, the psychology of optimal experience.
Okay.
Sorry, go ahead.
Okay.
Yeah. I'm glad you all tried to pronounce that.
I didn't do any good job.
So, like, in a lot of more basic gamification, there might not be anything interesting that the person is learning or there's not any skill that they're trying to practice or get better at and i think that's that's when i get kind of suspicious
and judgmental i'm like what what is how is this fun if the person isn't learning something here
maybe they're learning to game your weird gamification system instead of actually like doing the task that you want them to do so
having skill having something that a person is learning over time that they're getting better
at that they're like interested in getting better at uh and also um you're making me judge the games
i play so hard right now oh Games for games sake are a different
category, right?
Well, my
I have been playing a game on
my fitness thing
that now I'm judging
very badly.
I like
the idea of learning in games.
It makes sense
to me. I mean, when you think about Minecraft, that was all about learning.
Yeah.
It was all about learning the world and learning how the rules worked and even then learning more about how to make things in it that you wanted elsewhere. And as I think about some of the other even silly games I play,
like Threes, which I think is 2048 and other places,
but there are times when I'm still learning the rules
on this game that I have played for so long.
Because it's like, okay, I think right now this is what's going to happen.
And whether it does or doesn't.
Yeah.
Okay.
I totally get the learning.
Now, can they teach me useful things?
Yeah, totally.
Well, so one of the original games with a purpose was this game called, I'm almost going to call it Duolingo, but I'm getting to that.
It was called the ESP game, and it was a data collection game of two random people
on the internet are shown the same picture, and they can't talk to each other, but they have
to come up with the same words to describe that image. And if they match what the other person is
saying, then that becomes a label for that image. So two people will see a picture of like sheep in a green field and so
they'll type sheep green field sky clouds and some of them may type like or something yeah and
another person will be like well i didn't type butts because i wasn't thinking that i was thinking
of the sheep and so the ones that match up yeah like idyllic they will those will become the labels for that image and that uh that had this game mechanic of like am i gonna
figure out the words to describe this that another human will also come up with the same words um
yeah if you're sitting there identifying the sheep species in latin that may not be what the
other human does yeah you may not you may be
right but you may not be winning points exactly so so you won't go with those labels you'll find
the ones that are more common and shared um and this this game was by this guy luis van on and
because it was like making image labeling fun through a game,
it kicked off this whole series of other games with a purpose.
And,
and then other people's kind of like,
they didn't get the mechanics quite right in a way that like,
I don't know.
They weren't some things that came after.
I just felt like weren't good games.
Like the mechanics of the game didn't match whatever the,
the purpose was trying to do.
Like,
um,
there's just can't throw points at people.
No,
you have to give them more than that.
Yeah.
At least a little bit more.
I mean,
points,
sometimes they work enough that,
that people keep trying it.
They're like,
Oh,
I do like to see my name on a leaderboard,
but,
um,
but not everyone is like that.
And there really needs to be something
deeper where the person by playing the game is actually contributing to the whatever underlying
scientific or data cleanup purpose otherwise they may just be like racking up points but not
actually helping you out it sounds like to properly design a game, you actually need to have some psychological understanding
to know what motivates people.
And also, if you just do a naive thing, like you're saying with points,
you can end up with these holes, like you said,
where the game goes off in a different direction
and people figure out ways to game the system
and you don't get the data you want.
Yeah, exactly.
Like having the mechanics
aligned with the underlying purpose is super important um but you asked about like can can
i learn useful things from these games and um what louis van on is doing now probably other
things but one of his main things is this app called Duolingo for learning new languages and
it has like it's not a straight up game but it has a lot of like elements of a game like ramping
you up in a very gradual way and the idea of Duolingo in the first place was like there's a
bunch of text on the internet we need to translate more of the internet wouldn't
it be great if we had that and this was before like automated translation techniques were good
enough to use so like we need humans to do the translation but maybe people aren't skilled in
translating between english or you know obscure language one and obscure language two um or even english and like some other obscure
language and maybe not obscure but like any pairs of languages and um and so this this idea of like
maybe we can just teach people new languages and then they can start to help translate stuff
on the internet yeah i can totally see this working.
Because for me, it would be probably English and Spanish or English and French.
And you could give me an English phrase with an idiom in it,
and I would have to go figure out how to say that in Spanish
in a way that represented the idiom part of it, as well as maybe the words part of it.
And that would force me to go learn more Spanish, which is something I always want to do.
And it would help other people that if multiple people translated it similarly,
then you can start saying, oh, this is probably a reasonable translation.
Yeah, exactly. And then by being in this process where you're learning a little bit of new
skills and then applying them, you'll be able to translate more, more effectively,
and you'll just kind of grow and grow and grow in what you know and what you're able to do.
And even if you presented me with, these are five things other people said, which of this is right, you could do that and I would play and learn and not care so much about just points. It would be about fun. So now Duolingo is a free, sometimes ad-supported app that you can use to learn new languages.
And I don't know how much the translating stuff on the internet plays into it anymore, but it's this accessible language learning tool that seems really great.
Especially compared to like, pay $500 for Rosetta Stone or something.
Yeah, we don't need to talk about that.
I want to switch topics entirely
because you are part of this company that is weird and cool
and I have trouble explaining it because I get lost in AR and furniture.
And can you explain what GrokStyle is?
Yes, totally.
So GrokStyle is the company that I currently work for.
We do visual search for furniture and home decor and sort of expanding to AI for retail in general.
And what our core visual search technology does is allows you to take a picture of a piece of furniture, like some chair that you like at your friend's house, and identify what what it is. Or we can go beyond that to understand all of the products in designer showroom images and chair and this coffee table and this rug and these would all actually look nice together and
you don't have to worry about not having that stylistic judgment yourself if you don't actually
have that and that seems hard that does seem hard so there's it's just it's just math and data and linear algebra.
It's just math.
Okay, so I go to a friend's house, I take a picture, and their, I don't know, 15th century throne that I have taken a picture of, it then tries to find a similar throne that can be purchased now at some major retailer.
So it says, oh yeah, if you get this at Target, it's really similar.
And so you have to have a huge database of existing furniture. You're not just like, I'm taking this picture and then I'm going out to the internet and searching.
You have to already know a lot about furniture right yeah we have our own huge internal database of photos of furniture all like
millions of products millions of scenes of like ways that people have used this product in the
real world and we have learned this this like understanding of visual style.
Some way for anyone that takes a new picture of something,
for us to project that into some style embedding
and look up what's nearby,
what products are similar to this thing.
So if I take a picture of a mission-style couch,
which is a very
specific style you would be able to say oh yeah you might want a chair and this style of end table
we're working on the the recommendations part like for now we have a mobile app where we could
you take a picture of your mission style couch and we'll find more of those. More mission style couches for different prices from different places.
Yeah.
And how do you identify mission style?
How do you identify the style of what you're looking at?
Is this part of finding terms, search terms?
We are...
Tags?
The core of this is visual understanding.
So just from tons and tons of images of couches of different styles, we'll identify, like, these are the ones that look closest to this one.
And then we can look at the associated metadata to see what the name of the nearby matches are or what styles might be like tagged on those already but it
starts from like the visual path when we talked about sift and and how the eiffel tower isn't
really a good candidate because it has holes and because it has repeats um chairs so in this case we're just doing
deep learning on
tons and tons of images and
sift isn't involved
in
like sift is a feature
that a human would say I'm going to use
sift in this pipeline
and I've done
some other computer
vision stuff with faces where I was like we're're going to match faces by comparing SIFT features across faces.
And I have to decide, like, I'm going to use SIFT, I'm going to look at these regions of the image, I've got to get all my faces lined up first.
But in this deep learning era, we can say, here's a bunch of images of all these things, and I'll tell you how they're similar and how they're different.
And the computer can figure out what features and what internal representations are most useful, most discriminative for its purposes.
Does it have multiple stages?
Does it figure out it's a chair before it figures out what kind of chair?
And figure out chair versus couch versus table?
Our system does predict what category something is so yeah it'll say like i'm pretty sure this is a chair
so then it will go look up chairs instead of like looking across the entire database of everything
that we have because it would be more computationally optimal to say okay this is a chair now let's go into the chair subcategory
and finish looking up is it a 1916 chair post-modern chair right another thing we can do
though is we can say you took a picture of this this wicker chair and we know it's a chair but
if we start looking for tables that are nearby instead we
might find like wicker like some other aspect that's stylistically similar but in a different
category so our our like learned style embedding does kind of cluster um objects even if there
are different categories but they're still visually stylistically similar it will kind of
still put them together i should have asked you what your favorite machine
learning language was. Keras?
TensorFlow? Straight math?
We're
using several different of these machine learning libraries
and rolling our own in certain cases
and using Python to strap it all together.
All right.
Wow.
Ikea.
Tell me about Ikea.
Okay.
So Grokstyle is this visual search service provider
and Ikea is one of our big public clients right now
where they have an augmented reality app called IKEA Place.
And within that app, you can access a visual search feature,
and that is powered by us.
And so I go to my friend's house.
I see a chair I like.
I take a picture of it.
I say, you know what I want? I want this chair in my house.
So I go home and I go to Ikea app and then I say search and it says your chair is something that
has weird letter O's and then it just plops it into my... Yeah yeah so you could be at your friend's house use the ikea
place app to search there and say i'm going to figure out what this chair is and it'll be like
oh this is the poeing chair this is something else that you might struggle to remember and type in
later especially with all the accents and you can like favorite it in the app from there and then
bring it home and then place it into your home and see oh i like
i like how this fits i'm gonna consider buying this even though their chair may not be an ikea
chair right it's gonna find whatever's similar because that's what croxtile does yeah if you
take a picture of that cool throne that they have find the like closest ikea throne like item how does it how does it deal with size I mean it's just one picture it's just
there's no 3d how do I how do I know it isn't like a six foot by ten foot chair as opposed
to a normal size chair is that is that the future no No. It will find...
If you take a picture of a chair and there also happens to be miniature versions of that chair,
we might still find the little mini one.
And Amazon sells tiny little...
That's right.
Dollhouse chair.
Yeah.
We can't tell if you're taking a picture of a dollhouse chair.
This is not what I was thinking.
This doesn't fit in my space at all
but once you're in ar um those models are are all like true true scale true to life and um with the
like current capabilities of ar the scale like moving your phone around in your space and looking at what's in your space, that does like estimate what size your space is and what the scale of everything is.
So that if you put like a three foot tall chair or something out there, it will actually be the appropriate size and you can measure things.
So the AR part is okay.
It's just that I can take a picture of a doll chair
yeah or a giant chair and it will find the most similar and but it will then be normal size because
right the ar will show me what size yeah and i do have a little like ikea chair on my desk i should
do the demo of like take a picture of the dollhouse chair and then place the full-size one in my space.
Okay, I should ask you more about IKEA, but we're almost out of time and I wanted one more thing.
You started a Santa Cruz Pie Ladies meetup.
Yes.
Why?
So, Santa Cruz is like close to Silicon Valley, but not directly in it.
Close and yet so far.
Yeah.
And I wanted to meet more developers, more technical people, especially women.
I was like, they must be here in Santa Cruz somewhere, but I don't know where.
I don't know where they are.
I need this community around me.
So I started this PyLadies chapter in Santa Cruz to bring people together and it's worked out really
well so far. How much does it cost to be the person who organizes all this? I mean, is this
expensive? It is not terribly expensive. I work out of a co-working space called Next Space in Santa Cruz, and
they have rooms, conference rooms, and they allow me to host PyLadies for free because it brings
people from outside of Next Space into the space. So that would probably be the hugest cost
otherwise, just getting space. Although you could probably get companies to sponsor it as well.
And then on top of that,
there's like meetup fees for meetup.com.
But I think I can get a grant
from the Python Foundation to help pay for those.
And they're not that much.
It's like 40 or 80 bucks a year.
And then there's food and snacks,
but sort of been like figuring that out
over the last few months of how much food we need.
And people like yourself bring snacks as well.
So it's sort of community supported right now.
And one of the reasons that I wanted to have this meetup in the first place was I went to some of the other meetups. There's
some JavaScript meetup at a bar and there was a lot of dudes there. And I took my two-year-old
daughter with me. So there was two of us women, but it was like I had to bring my own extra female that I had made. And so it is limited to women or people who are?
People who identify as ladies, as pie ladies.
I mean, it's open to anyone that would feel comfortable in that space. Although if you are a man,
we request that you come as a guest of another person in attendance.
And do you spend a lot of time organizing it?
I should probably spend a little bit more time
finding more people to give talks and stuff,
but not too much, no.
So it's not that big of a cost so it's not that big of a cost it's not that big
of an effort but you do get a fair amount out of it yeah what do you get out of it i mean i didn't
know there was a vi game but yeah uh so uh like 10 or so people show up to the meetings and we have them every two weeks.
And it alternates between whether it's like a project night and people come and work on projects together or we just like talk about all kinds of things together or a speaker night where someone presents. space with other women and other tech people in the area and seeing what other people are working
on and sharing ideas and just getting excited about things. It just brings like warm fuzzies
to my heart. I enjoy it and I'm glad you started it because it is hard to find a good technical community and many of our meetups do tend to meet in bars and i'm unlikely
to go to a bar to meet people just because i yeah it's not where i want to talk because i can't hear
anything yeah it's hard to to get into the nitty-gritty technical details sometimes if it's dark and loud and you don't have a
computer around and you don't get to like really know what other people are passionate about and
what they're excited about and how that can sort of rub off on you and get you really excited about
something but if you're in like a sort of more collaborative space or environment like i'd love to have longer pie
ladies meet up sometime like little dev house style pilot Saturday morning yeah we could um
and i thought it was interesting that one of the presenters then went to go to a job interview and
was asked a question that was basically from her presentation.
Yeah.
And it was funny because it wasn't, because it is every two weeks, or every four weeks there's a presenter.
It's pretty easy to sign up on the presenter list, let me tell you.
But it is good practice.
Yeah. I mostly wanted to ask you about it
because I want to encourage people who have this idea
that it doesn't have to be a lot of effort,
and sometimes it doesn't work.
I mean, there's a decent chance that it may, in five years,
just be you and me looking at each other going,
well, maybe this has run its course
yeah which is also fine yeah but for the those five years or whatever that it exists like it can
be all kinds of great opportunities i'm meeting new people people who have sent me to other
meetups which were then way too crowded but yeah it's neat and there's there's two women there that run a python study group in felton
yeah so like they're they're on top of like we're just gonna do this thing for ourselves
yeah so if you're out there thinking gosh i wish there were other people
that i could talk to whether it's py PyLadies or JavaDev.
The space is the hardest part, but if you can find a space,
even if it's a coffee shop that has a back room,
it might be worth it.
It might be worth it to try it.
And $40 or $80, yeah, that's a lot to try it,
but how much do you spend on conferences? This is like a year-long conference, one hour at a time.
And those fees are only for meetup.com.
Which is kind of the easiest way.
Yeah, it has made it very easy.
And people have found the Pilates Meetup through meetup.com.
But if that was a cost or something,
maybe there's more organic ways to advertise
and just get people
together that you want to share your technical interests with yeah i found a writing group on
next door of all places so it's all kinds of stuff yeah all right we have kept you for quite a while
given um oh we have so much more that we could talk about. We do.
We totally do, which just means that you can come back.
And since you're local, come back.
Yeah, that'll be easy.
Chris, do you have any?
I was wondering if you had advice for people who want to get into this whole space,
either if they're in college or hobbyists or people who are professionals who want to change
to something i mean what's the right what's the right path to start learning about this whole
space because it seems like a lot of different things which yeah which part of the space like
the computer vision part the like building interactive systems that people can play with,
part the game design part.
I guess the computer vision part, yeah.
Because it's a popular thing right now,
there are a lot of tools coming out,
including tools for making your own models and using them.
So I think TensorFlow is being ported to javascript and trying to make
it as easy as possible for people that might be like in a web programming language to get access
to these tools and then build things that are like running in other people's browsers so like
they're the easiest possible thing to share i think personally like going that route where you are using like
JavaScript type things where you can make something small and share it with your friends
and your friends will be like, wow, that's so cool. That will just like give you a ton of
encouragement to like keep going. And and then i think with javascript
you can like look around and see how other people are doing this because you can maybe get access to
the code a little bit more easily um so i'm doing it like a social kind of way yeah i mean there's
there's a lot of good social benefits to being able to share especially if
you're just getting started and trying to figure it out cool what about getting started in games
with a purpose uh this morning there were tweets from this human computation conference called H-Comp,
which is happening in Zurich right now. And I think there's a keynote from the people doing Zooniverse,
which is a platform for all these different citizen science projects.
And some of them may not be game-flavored at all,
but they're probably game-flavored ones
or ones that could be more engaging
if they were more game-like
and helping ramp people up and learn things.
Zooniverse is the citizen science place
that does Galaxy Zoo,
where you can identify different galaxies or different features in pictures. that are out in wild where animals will walk by and a motion sensor will trigger and the picture
the camera will take a picture and then citizen science people have to go and tag those to say
there's actually an animal here it's a it's a fox it's a bunny it's a deer it's a elephant and so i
think there's there's lots of these that are out there um like ones that you can go find and
participate in um and then i like zooniverse has a platform for making more of those.
So if you have an interest in kind of working on the building of those tools,
buildings of those projects, I'm sure like there's space for that as well,
like whatever your passion is or even getting involved with the existing ones.
Do you have any thoughts you'd like to leave us with?
Last brief thought on augmented reality.
Visual search is going to be a big part of that,
understanding what your environment has in it already
so you can do more meaningful, more intelligent augmented reality.
Our guest has been Kathleen Toot, computer vision expert and software engineer at CrocStyle.
If you'd like to join us at PyLadies in Santa Cruz, there will be a link to the meetup in the show notes.
And if you're not local to Santa Cruz, there are lots of PyLadies and lots of meetups.
Check around. It's worth it.
Thank you for being with us, Kathleen. Thank you for having me.
Thank you to Vicky Toot for introducing me to Kathleen and for producing her.
Thank you to Christopher for producing and co-hosting this show. And thank you for listening.
You can always contact us at show at embedded.fm or hit the contact link on embedded.fm.
Thank you to Exploding Lemur for his help
with questions this week. If you'd like to find out guests and ask early questions,
support us on Patreon. Now a quote to leave you with from Douglas Engelbart.
In 20 or 30 years, you'll be able to hold in your hand as much computing knowledge as exists now in
the whole city or even the whole world.
I don't know when he said that, but I bet it's still true.
Embedded is an independently produced radio show that focuses on the many aspects of engineering.
It is a production of Logical Elegance, an embedded software consulting company in California.
If there are advertisements in the show, we did not put them there and do not receive money from
them. At this time, our sponsors are Logical Elegance and listeners like you.