The Science of Everything Podcast - Episode 113: Visual Processing

Starting point is 00:00:33 You're listening to The Science of Everything podcast episode 113, Visual Processing, and I'm your host, James Fodor. So this episode is a little bit different because I recorded this originally several years ago as part of the Vision series of episodes, which was episodes 45 and 47, so that's actually going back quite a ways now. And I kind of at the time archived this episode because I decided that we had sort of enough on vision for the time being, and for a few other reasons, I wanted to rework some of the content in this episode.

Starting point is 00:01:09 And it's been sitting in my folder for, well, several years now. And as I think I've mentioned, I've decided to put up an episode talking about the work that I've been doing as part of my thesis this year for my master's degree in neuroscience. And as I've just been recording and editing that, I realize that I have this episode where I discuss many of the same issues that's been sitting on my hard drive for years. And I figured, well, now is a good time to put that out. So just bear that in mind so that perhaps the recordings sound may be a little bit different because this was recorded several years ago. In this episode, we're going to look at the ideas of processing of visual information in, well, the visual system and how object recognition and feature extraction occurs.

Starting point is 00:01:47 And so this episode follows on from episodes, as I mentioned before, episodes 45 through 47 on vision, so I recommend listening to them first. Some of the episodes on the nervous system may also be useful, such as episode 38, neurons and synapses. so check that out as well and without further ado let's get started in talking about visual processing the first thing to understand is the difficulty of the problem that the visual system faces here

Starting point is 00:02:13 it's hard to gain an intuitive feeling for how difficult the problem of visual perception is that is going from the raw input that we have say onto the retina or you know pixels on a screen or something like that to going from that to distinguishing particular objects and being able to say this is there and this is in the background and this is a table

Starting point is 00:02:30 this is a cat identify things and actually giving structure and meaning to the scene. That's a very, very difficult problem. So it's been discovered, various sort of psychophysical experiments that humans can interpret most of the sort of crucial core details, or, you know, the big picture of what's going on in a reasonably complex visual scene,

Starting point is 00:02:48 in about 150 milliseconds. So that's a small fraction of a second. Now, how is that interesting? Well, action potentials, and remember the brain can only really communicate, ultimately through action potentials, take a few milliseconds to generate. It kind of varies a bit, and there's refractory.

Starting point is 00:03:02 periods where neurons can't fire and so on, but it's at least a few milliseconds, so on order a few milliseconds. So this means that if you basically divide 150 milliseconds, which is total processing time, minimum processing time, by a few milliseconds per action potential, that means that whatever processing is done by the brain must occur in at most a few dozen steps, and often much less than that, because humans can perceive visual stimuli and make sense of it in less than 150 milliseconds. That's for a fairly complex visual scene. Simple things can be discerned more quickly. There are complexities in this because there's a type of memory which is called sensory memory, which is actually basically just maintains the immediately previous

Starting point is 00:03:39 visual input for like a couple of hundred milliseconds and gives you extra time to view it and therefore process it. So you have to have a way of it's called backward masking this sensory memory so that you're not sort of cheating in a sense by viewing the original sensory perception for longer than essentially you should be. All we want to know is present it for a very short period of time and then how long does it take the process it? We don't want to have that extra little bit of the sensory memory in there, giving you extra time in a sense. If you want to observe sensory memory, the best way I've found to do so is just to wait until night and turn off all the lights in your house. Or just make sure that it's very dark and make sure your eyes have

Starting point is 00:04:14 adjusted to the dark and then quickly flip on the light and flip it off again. And what you'll notice is that you can observe the room for like a fraction of a second after you flip off the light, or you could even just do it so that you have the one light on and you're looking around the room, and then you quickly flip it off, and you'll notice that for a fraction of a second after you flipped it off, you can still observe the room. Do this a few times.

Starting point is 00:04:35 You'll see that there is a very short period of a buffer there where basically you're not actually receiving any input to the retina, but your visual perception memory is maintaining that prior perception so that you have extra time to process it. So anyway, the point is that if we want to know how long processing it's self-tax, we've got to find a way of getting rid of that. So, how does all this happen in the brain? How does the brain make all these complex inferences based on the very minimal input that it gets from the retina?

Starting point is 00:05:01 Well, we don't really know, but we have a few ideas. So I'll talk about these, basically, from, in increasing order of complexity. So we'll start at the very low level, which is edge detection. How does their brain know where one object ends and the other object begins? How does it detect the edges of things? Well, we have many, many algorithms that have been developed for machine vision that are able to detect edges, and they can do a reasonably good job at it. Obviously, that's only a very low-level task, but it's still not trivial.

Starting point is 00:05:26 I think the simplest one that I'm going to talk about, and that gives you sort of the basic idea of how it works, is the case of just detecting a single line in a particular orientation, not shaded areas, just how would we detect one, basically black, straight line? So imagine that we have two neurons next to each other. These could be ganglion cells, or, I guess, bipolar cells would actually make most sense considering what we're talking about. So remember, bipolar cells have these.

Starting point is 00:05:50 concentric circles of visual fields, the inner region and the outer donut region. Imagine we had an on-center and an off-center bipolar cells next retellar. So two of these cells next to each other, and the receptive fields of these two neurons would overlap. So the central area, the on and the off, would be separate from each other. But remember that the donuts, you can imagine overlapping. So the left side of the donut for one overlaps the right-side of the donut for the other. So there's a central region between the two of the bipolar cells where the outside of their receptive fields overlap, if that makes some sense. Now suppose that the line we're interested in detecting

Starting point is 00:06:24 runs right through that overlapping region. If that's the case, then we could set up a network that would detect a neural network, basically, that would only fire if that vertical line ran straight through the middle of these two bipolar cells. How would that work? Well, remember, if we've got one on-center bipolar cell and one off-center bipolar cell,

Starting point is 00:06:43 so the two peripheral visual fields of these bipolar cells are overlapping. So that means when the line passes through these peripheral visual fields. One neuron is going to experience polarization and the other one's going to experience depolarization because the on-center neuron will have an off-center, will have an off-periphery, so its periphery will respond to no light. The other neuron, the off-center neuron will have a periphery visual field, receptive field, that responds when light falls on it. So in other words, because the neurons are of opposite types, the peripheral visual fields of each neuron

Starting point is 00:07:15 will also be opposite. So one of the peripheral fields will respond to light, the other will respond to absence of light. And so if light falls on both of these at once, because remember we're saying these peripheral regions of the peripheral donut sections of the two bipolar cells overlap, they're on the same position, then a single, say, a single line falling on, or incident upon the visual field in that overlapping region, that would indicate that there was a boundary, basically, that there was a region that had light and then a region that didn't have light. And then suppose we could connect a bunch of these different neurons together, or, say, a bunch of these different ganglion cells together from a line. Because remember, we've got retinotopic topic mapping,

Starting point is 00:07:53 or spatial topic mapping, so that ganglion cells that are next to each other respond to input to regions of the visual field that are also next to each other. So if we draw a line across these ganglion cells and say, we need all of these four here to respond in order to get an output from this other neuron that they're all synapse with, then if all of those four ganglion cells are active, what does that tell us? Well, it tells us that there is a line of that particular orientation. We could also have inhibitory cells on either side, so it would only respond, say, to a line

Starting point is 00:08:22 and not to a square, because, say, say, if you had one line of ganglion cells that deliver activations to a... Let's suppose that all of these ganglion cells synaps with a single cell in the lateral genucle nucleus. All four in a row, they synaps with the same cell. And in order to get the lateral genucle nucleus cell

Starting point is 00:08:38 to activate, you have to activate all four of these ganglion cells. That's just how the strengths of the synapses would work, basically. activate four ganglons to get the one LGN cell to activate. If you activate all four of them, that would indicate that there's a line of a particular orientation, and so your LGN neuron that detected these lines, or maybe a V1 neuron that detected the lines, or would activate. But now suppose you had a second line of ganglind cells. Imagine that it's parallel

Starting point is 00:09:02 to the first line of ganglion cells, but instead of being excitatory, these are inhibitory neurons. So that is, they send out hyper-polarizing signals to the LGN, to the same neuron and the LGN, the one that we talked about before. If only the first line of ganglion cells that we talked about went off, activated, then the neuron in the LGN would be activated and would set off an action potential. But if both the inhibitory line of ganglion cells and the excitatory line of ganglion cells were both activated,

Starting point is 00:09:26 they would essentially cancel each other out, and so you would get no output. And why would you want that? Well, because basically, it might be a situation where you only want an input if you have a line of activation in this particular region, a narrow strip, basically, of light, not if you have a fat strip of light.

Starting point is 00:09:40 If the thin strip would only cover the excitatory strip of ganglion cells, and therefore, if you only had that, you would excite the LGN cell and off you go. But if you had a fatter strip of light that covered the excitatory and the inhibitory strip of ganglain cells, then you would get no excitation, and therefore the LGN cell would not flyer. So if you connected the cells up in this way, you would get an LGN cell or maybe a V1 cell that only responded to a narrow strip of light, but not a fatter strip of light. And similarly, you could imagine just changing the orientation of the ganglion cells that we have

Starting point is 00:10:11 connected up to each other, well, connected up to the LGN cell or to the, or to the the V1 cell, and that would allow you to detect, say, a vertical line or a horizontal line or a diagonal line and so on. So basically, just by connecting up the ganglion cells in the right way to the correct cells in the LGN and V1, it's relatively easy to get LGN cells and V1 cells that will respond only to lines of a particular orientation or to edges, basically, of objects. So the neuron will only respond when it sees a difference in the shading and two adjacent regions of the visual field, not the same shading. So this is quite interesting that it's actually relatively easy to get that, to use neural

Starting point is 00:10:49 networks to get that basic level of edge detection and of line detection. Moving on to higher levels of visual perception, though, there's become a lot more tricky. In particular, depth perception is an especially hard not to crack. Depth perception refers to the ability to take essentially two-dimensional input. So the input that we see on the retina is just two-dimensional. There's no depth to the retina. There's only one layer of cells. So we only get that two-dimensional input.

Starting point is 00:11:12 But from that two-dimensional input, we have to extrapolate, somehow use that to interpolate the three-dimensional structure of the actual scene that generated this input. It's the same thing when we see a picture. I mean, literally what we see is a two-dimensional image, but we perceive a three-dimensional scene behind it. How does that happen? Well, it's not very well understood,

Starting point is 00:11:31 in particular, because it's essentially mathematically impossible to go directly from, or simply, from a two-dimensional scene to a three-dimensional scene that produces because there are many potential mappings from the two-dimension to the three-dimensional image. If you don't really know what I mean by that, then don't worry too much. But basically, there's not enough information. If you take all of the information from a three-dimensional scene and try and compress it onto a two-dimensional picture, you'll have to lose some information. And that means the two-dimensional image does not give you all the information

Starting point is 00:11:58 you need to extrapolate back to the three-dimensional scene. So in order to get from the two-dimensional to the three-dimensional, you have to make some assumptions. You have to assume maybe that things like if an object blocks the view of another object, then that object must be in front of the other object. Now, that seems to be like a really obvious thing to say, but you can't necessarily assume that. There are optical allusions which are based on this, basically, where it looks like one thing is blocking another thing,

Starting point is 00:12:21 but in fact that the shape's just been altered so that it looks like it's occluding the view. You know, the point is, if we see a person standing in front of a tree, then we perceive that as the person is standing in front of the tree, the person is standing there and the trees behind them in the actual three-dimensional structure. But it could well be that the tree just has a exactly person-shaped hole in it, and the person is standing behind the tree.

Starting point is 00:12:41 You know, that's possible, but our brain doesn't interpret like that. Basically, it uses the heuristic of occlusion. If one object blocks the view of another object, then the object has to be in front of the second object. It has to be closer to us than the second object. Again, seems obvious, but is not necessarily always the case. But the point is that it's in the everyday world that we interact, that humans interact in.

Starting point is 00:13:01 It's true enough of the times to be useful for our visual system. Some of the other techniques that we think the visual system uses for depth perception, it's not 100% clear exactly what it uses, but some of the things we think that it uses is the orientation of lines. So if you have two diagonal lines pointing towards each other and sort of leaning in towards each other, humans naturally perceive that as the lines receding off into the distance. So this is the famous, like the almost certainly seen this illusion where you have two lines that are sort of diagonally sloping towards each other, and then between those two

Starting point is 00:13:32 diagonal lines, you've got two parallel horizontal lines. What it looks like is that the top line, the top of the horizontal lines, is much longer than the bottom line, because basically the top line is much closer to connecting with the two diagonal lines on either side. The bottom line is much further, has a much greater distance between each of its ends and the two diagonal lines. The reason for that is, of course, is the diagonal lines are moving closer together as they, as they slope inwards, and so there's less space that the top horizontal line has to cover than the bottom horizontal line. But in fact, the two horizontal lines are exactly the same length. It's just an illusion. It looks like the top line is bigger because the way we perceive it, the way our brain

Starting point is 00:14:08 tends to interpret these things is that the horizontal lines aren't just two lines that are leaning towards each other. They're actually, they actually are parallel. They're just moving away into the distance. Basically, like imagine staring up a train tracks, and that's what we're observing here. And so the inference from that would be that the top horizontal line would actually be further away from us than the front horizontal line if this inference that the brain is making is true. And therefore, it looks. the same length, but actually it should be longer because it's further away. But again, if you just see that visual illusion, it becomes obvious, basically, what I'm saying.

Starting point is 00:14:36 But the point of that is that the brain is using the idea of the horizon and of how the three-dimensional world works in order to make sense of the two-dimensional image that it's observing. Yet another technique that the brain uses is stereopsis. This basically refers to the fact that the apparent position of an object will differ depending upon the vantage point from which it is observed. So the easiest way of understanding this is just to put your finger out in front of you, and close one eye and observe where it is, and then open that eye and close the other eye and observe the apparent change in the position of your finger.

Starting point is 00:15:09 It's easy to observe when you put your finger closer to your face. The closer your finger is, the more it seems to change position when you switch from viewing it with one eye to viewing it with the other. The further away you move it, the less it seems to change in position. And if you view like the moon or something very far away, it won't appear to change in position much at all because it's so far away. So this principle of stereopsis, again, allows the brain to make inference about how far away an object is based on how much the image of the object differs between one eye and the other.

Starting point is 00:15:35 And this would explain why, if we remember back to the LGN and the visual tract and so on, that the information from each eye is kept quite distinctly separate. You know, it goes to different layers in the LGN and then different parts of V1. Well, one of the reasons for this is likely because the brain uses that, the brain needs to keep separate the information from each eye, so it can then use it later to determine the difference in the apparent position of objects from one eye to the other, and thereby infer how far away that object is likely to be.

Starting point is 00:16:02 So this information is necessary for stereopsis. A fourth technique that the brain likely uses for a depth of perception is parallax, which is a very similar idea to stereopsis, but it's not quite the same thing. So instead of the difference in the apparent position of an object from one eye to the other eye, that's stereopsis, parallax refers to the apparent motion of objects as you move. So the best way to understand this is just to, you know,

Starting point is 00:16:23 you look at the ground right next to you when you're driving in a car, and it appears to be moving very fast. You look at the trees on the side of the road, they appear to be moving, but not quite as fast at the ground. You look at the mountains in the distance, they don't appear to be moving at all. So basically, if you're moving, things that are closer to you appear to be moving faster than things that are further away from you. The brain uses this piece of information to basically infer,

Starting point is 00:16:41 while the faster in objects moving, you know, other things being equal, that means it's probably closer to me than objects are moving slower. They're probably further away. You know, and it combines this with stereopsis and occlusion and the various other rules of perception and so on, to determine how far away an object is. And this would explain, again, why the brain needs to keep separate, information about identifying what an object is

Starting point is 00:17:00 and information about identifying how fast and where it's moving, because it needs to use that latter information for depth perception. So as I said before, it's not exactly clear which of these techniques the brain uses, or, I mean, it almost certainly uses all of them, but to the extent to which it uses them, one versus the other, or how important they are, and exactly which parts of the brain

Starting point is 00:17:18 implement them are not entirely clear. Moving on from depth perception to talk a bit about motion, this is another interesting problem. how do you determine, based on a fixed visual image or a sequence of fixed visual images, that some part of one visual image one corresponds to another part of visual image two. That is something you see in visual image one is actually the same object that you see in a different place in visual image two. How do you know that that blob over there to the left in visual image one is actually the same thing that's creating a blob onto the right-hand side of visual image two?

Starting point is 00:17:50 How do you know that it's not something different? How do you know it's the same thing that's moved? Again, this is something that the brain has to figure out. And it seems that the brain uses, again, various heuristics that work pretty well in our everyday world, but aren't necessarily always going to work in every situation. And again, this is why we can get, there are many visual illusions, which is relatively easy to demonstrate, where the brain makes these assumptions, but it's actually wrong.

Starting point is 00:18:11 So one such assumption or heuristic that's used for motion is basically that objects in the real world move continuously. An object can't jump from one place to another. It has to move through the intervening space. and so objects that jump will appear to us to be different objects, but if it moves continuously, it will appear to be the same object. Another interesting way that the brain can potentially detect the motion of a single object, it is, so remember in region V5 or MT, same thing, this region contains direction-sensitive motion columns

Starting point is 00:18:37 that are sensitive to motion in particular directions, and remember V1 and some other areas as well, contain orientation-selective columns of neurons. Well, imagine if you combine these two together, imagine if you had a cell that responded only when, it observed, say, a horizontal line. So, say, it took an input from a neuron in V1, which responded solely to a horizontal line, and it also took input from a neuron in, say, area MT, which responded only to up and down

Starting point is 00:19:04 or vertical motion. And suppose this third hypothetical neuron that we're talking about, so it only responded when it observes a fixed line, horizontal, and it also observes motion, again in the same part, in the same region of the visual field, it observes motion coming, say, up and down. If that was the case, then basically, what that neuron will be observing is an object, say a line or a square or something like that, that has a wide, straight base moving downwards or perhaps moving upwards, because it's seeing the width of the object, and that's the V1 orientation neuron,

Starting point is 00:19:35 that's providing that information about the width of the object, and it's also seeing the direction of motion of the object, and the MT neuron is providing that in terms of the up and down motion. And the reason that those two do vectors, basically, in terms of the direction that the line is pointing from the orientation column, and the direction that the motion is going in from the M.T. neuron, those need to be perpendicular to each other, because generally, whenever you have an object moving,

Starting point is 00:19:57 you're going to have one of those pairs of a perpendicular motion. I mean, you think about if a car moves, you can observe the front of the car will sort of be vertical, and the car itself will be moving horizontally, and so you'll get a perpendicular pair of lines there. Pretty much anything that moves, regardless of from some vantage point, if you can see it moving, there'll be some perpendicular combination of lines,

Starting point is 00:20:16 so that the object is moving in one direction, and that the one side of the object has a line that is perpendicular to the direction of motion. So if you had neurons that were connected up in just this way so that it could detect the perpendicular directional information and motional information, then that neuron will be able to detect the motion of an object. I don't know if any neurons have been found that are so that are specifically connected up in that way, but it's one plausible mechanism that the brain could use in order to detect object motion. Moving on from object motion, we can talk about another problem, which is even higher level problem, of detection of object parts.

Starting point is 00:20:48 How does the brain work out what the different parts of an object are? Maybe the brain can observe a person, but how do they know that the hand that say a person is broken up into, you know, that they have an arm, which is part of the person, and a hand is part of the arm, and a finger is part of the body directly. It's all part of the arm directly. A finger is part of the hand, which in turn is part of the arm, which is turns part of the body. How does it get that sort of hierarchical structure of parts, and how does it know which parts belong to which object and so on? How does it know that the finger is not just sort of floating out there,

Starting point is 00:21:17 or the finger's not part of the tree, but the rest of the hand is part of the body. This is very poorly understood. There are some ideas, though. There are many, many algorithms that have been developed that can, to varying degrees of accuracy, identify object parts. One technique, which I can't really go into, because it's a bit too complex to explain, and it's a bit beyond the level we're going for here.

Starting point is 00:21:34 But the basic idea is that you try and find regions of curvature or concavity relative to their surroundings, and these tend to mark out the joints between parts of objects. So, I mean, if you imagine the place where your arm joins your torso, So basically the central region when that joint occurs, you can sort of draw a circle around there and it will look sort of curvy. Same thing with your leg joining your lower torso. Same thing with a trunk sticking out of a branch. There's sort of a circular curvature that will be associated with the particular part of the objects that connect to each other.

Starting point is 00:22:02 It's very hard to explain this. But basically, it's just a property of geometric shapes that when you connect them together, you can measure like curvatures or concavities of the shapes that are specific to the regions where different parts join. And it's not perfect, but it's just a general rule that would. it enable you to detect those parts. And so if the brain was able to detect those regions of high curvature or concavity, then it would be able to break the object up into parts. Another example would just be in terms of the concavity.

Starting point is 00:22:25 Imagine if you had something like a wine glass, so a wine glass starts out narrow at the bottom, then moves, then tapers inwards, and then splays out again. If the brain had some kind of algorithm that allowed it to detect regions of concavity, that is, whether the lines are pinching inwards most strongly, then it would, and it applied that to the wine glass, it would naturally divide the wine glass up into the base and then the rest of it, or possibly the base, the sort of a stalk and then the main part of the glass.

Starting point is 00:22:51 Maybe it would do something similar to a flower, for example, divided up into stalk and petals and the central region if it used this sort of concavity rule. So that's not perfect, but it's just an idea of how the brain could use general geometric properties of objects in the world to try and infer where the different parts of the objects are. There's also some many theories about how the brain might detect shapes and distinguish between shapes. One interesting theory is that of generalized cylinder, This is the idea that many complex shapes in the real world

Starting point is 00:23:16 can actually be represented by cylinders with just a few of the properties of the cylinder tweaked. So, for example, instead of the cylinder, you're just going straight up, you might be able to curve the direction, basically curve the vertical axis of the cylinder, so basically that it sort of bends down to one side or the other side. You might also be able to change the radius of the cylinder

Starting point is 00:23:34 so that it maybe tapers up into a cone or splays out fat or something like that. It turns out that you only need a very small number of variables or parameters that you can change, of a generalized sphere to actually produce a very large number of complex shapes. So the idea is that the brain doesn't need to encode a very large number of potential different shapes. It just needs to encode the basic parameters of a generalized cylinder and then be able to alter those depending on the particular structure of the shape that it's viewing.

Starting point is 00:23:58 And it sort of fits each complex real-world shape onto sort of abstract or simplified templates, basically, that it has of different basic shapes. And there's some support for this, that humans tend to, when you ask them to recall visual information, and they tend to produce sort of stereotyped or simplified versions of them. They don't remember all the precise details of a visual scene or of the shape of an object. They tend to simplify things and, you know, cut off corners and round edges and stuff like that. So there's some support for this idea that we retain the basic outlined, sort of simplified shape of things rather than the gritty details.

Starting point is 00:24:30 And finally, the really naughty problem of visual perception, which is what we started talking about, object recognition. This is basically distinguishing between a cat and a dog and a chair. How do you actually tell what the object is? This is basically as high as it gets in terms of levels of processing. You know, starting from just literally like individual pixels of light to recognizing lines and edges, to recognizing different shapes and motion and depth perception,

Starting point is 00:24:53 to recognizing object parts and putting those parts together, to finally recognizing the object itself and working out what you're actually looking at. There's sort of two broad schools of thought on object recognition. One is basically the bottom-up approach, and one is the top-down approach, and then, of course, there's a combination of those schools, you know, that it's both. definitely there are elements of bottom-up and top-down processing in the brain, but let me just explain what I mean by that. Bottom-up processing would basically hold that the brain processes visual information

Starting point is 00:25:18 by starting with very simple information, so, you know, straight from the retina or the ganglion cells, all that it tells you really is there's light here, there's light here, there's no light here, there's more light here, and so on. So just basically pixels of light across the visual field. Then as you moved into, say, V1, you would build up from that very low-level analysis too. Well, there's a line here of this orientation, there's a line here, this orientation, and so on. And then maybe as you moved up to V2 and V3, you would see, well, there's a circle here and a square here, and they're relatively close to each other, and then maybe as you moved up to areas like

Starting point is 00:25:48 MT and V4 and so on, you would say, well, there's a yellow pyramid here which is moving in this direction, and then there's this sort of cylinder shape here with this curve on it, which is moving over there, and so you get high, higher and level, abstract levels of shapes by building up from the lower levels, and then finally you get to like areas in the IT, inferior temporal cortex or other regions like that, and by the time you get there, you're actually recognising shapes and objects in particular. You know, that's a dog, that's a cat. And so the idea of bottom-up processing is you build up progressively getting more and more complex from lower levels until you get to the very highest levels of complexity that can recognize complex scenes and objects and so on. Now, one prediction of that,

Starting point is 00:26:24 or at least potential prediction would be the idea that, well, this process of building complexity would culminate in hypothetical cells, which would specifically fire only to a very narrow range of concepts or of objects. So this idea is referred to as the grandmother neuron hypothesis. The idea being that, well, somewhere in your brain there would be a neuron or maybe just a small group of neurons, a handful of neurons, that would only respond to your grandmother, because it's specifically attuned to just that pattern of inputs that corresponds to your grandmother. Now, this hypothesis is still controversial. There's some evidence that, well, I don't think there's really any good evidence for grandmother's neurons per se. I don't know if anyone really believes,

Starting point is 00:27:05 Well, I think some neuroscientists do believe that there are grandmother neurons, but generally the idea that it gets that specific is, I think, a minority position. There certainly is evidence for neurons that are specific, the process-specific types of objects, like the fusiform face area neurons, for example. So there do seem to be neurons that only respond to faces. But neurons that respond to just one particular face, that's a bit harder to determine. And also, again, you face the same problem that I mentioned before, of how do you know it only responds to your grandmother?

Starting point is 00:27:33 Have you tried all the different other potential faces? that it could respond to. Maybe it just responds to gray hair, or maybe it just responds to older women, or maybe it just responds to glasses. Like, there's so many different aspects of the image that it might respond to. You don't, you can never really know if it only responds to your grandmother. But I think there are various reasons why the grandmother neuron hypothesis in its sort of pure form is implausible. Well, one reason is we've never really found any neurons that are that specific. The closest is the fusiform face area, but even there, it's a, it's not that specific as far as we can tell. Another reason is that it's hard to see how we have enough such

Starting point is 00:28:04 cells in the brain. Now, there are certainly a lot of neurons in the brain, about a hundred billion, but think about the number of different objects or visual scenes we could possibly see. And of course, all of those hundred billion aren't devoted to visual processing. I don't know what the percentage is, but it's got to be less than 10% or something like that. So, you know, a few billion neurons, there's definitely more than a few billion potential scenes or objects that you can see or even have seen in your entire life. Now, these might not all be coded individually. There might be some overlap and whatever. But the point is, a very simplistic level of analysis. It seems implausible that you just have one neuron,

Starting point is 00:28:34 for every specific individual thing you've seen. Another reason, I think probably the biggest argument against the grandmother neuron hypothesis, is that the brain has been very consistently demonstrated to exhibit what's called graceful degradation or gradual degradation. That is, if you damage parts of the brain, you might get unlucky and happen to hit some very crucial area, which would make you go on contrast to something like that. But generally, you might not even notice anything. In fact, if you've heard of the story of Phineas Gage, he had an iron rod several centimetre

Starting point is 00:29:02 thick pass right through his prefrontal cortex and survived wasn't dramatically harmed. He had a bit of a personality change afterwards, but he certainly survived, and he certainly still functioned more or less okay, and there'd be many other cases where substantial brain damage has not been associated with dramatic loss of function. You know, people can have strokes and recover from them to a substantial extent. The point is the brain degrades gracefully. When you damage it, it loses some of its function. Maybe you damage it a bit, and it goes down to 90% functionality. You damage it a bit more. It goes down to 80% of its functionality. It gradually gets worse and worse.

Starting point is 00:29:33 It doesn't catastrophically just completely stop working if you damage it just a little bit, which is what happens to, say, computers, generally speaking. The brain is not like that at all. So this seems to be inconsistent with the grandmother neuron hypothesis because it basically said, like, if that one neuron died or that small group of neurons died, you wouldn't be able to recognize your grandmother anymore. And no one's been observed, as far as I'm aware, who have deficits in terms of recognizing just one type of object.

Starting point is 00:29:56 There have been deficits observed where people will have trouble recognizing just faces or trouble recognizing objects in general, but they can use the object and so on. We talked about those earlier, but none that I've heard of where a person just can't recognize just tools or just like a hammer or something like that. They can recognize everything else but not a hammer.

Starting point is 00:30:10 I don't think that's been observed, and it would be inconsistent with the general principle of graceful degradation that we see in the brain generally across a wide variety of different types of functions. So I don't think the grandmother neuron hypothesis is particularly plausible, but it is possible, and maybe there's a degree to which that's true.

Starting point is 00:30:27 So what I've been talking about, so if that's bottom-up processing. The top-down processing model is essentially, saying the opposite. It's saying that a large extent to which our visual perceptions are shaped by expectations and context and memories and things like that. So essentially we see what we expect to see or what we remember seeing. That would be possible if, for example, we had a lot of back propagations of neurons running from V1 back to the LGN and from V4 back to V1 back to the LGN, so if the neural inputs ran in the opposite direction instead of just forwards. And indeed, as we

Starting point is 00:30:55 saw before, this is precisely what we observe. We have a lot of neural inputs running backwards, which would be consistent with basically information not just going forwards from the bottom up, but also going backwards from the top down. And indeed, you certainly, in psychological experiments, we do observe a great number of effects whereby people's moods and their attitudes and their memories and their expectations and things like that strongly shape what they perceive when you tell them, you know, look at this image or whatever. One example that I recently heard about was that a bunch of doctors were asked to view these,

Starting point is 00:31:25 I can't remember precisely what imaging technique was used. Scans of the lung, though, maybe they were expert. or something else, but scans of the lungs to observe and see if they could find any tumours there or anything else. And so they were scanning the lungs for tumors. They were only asked to focus on those things. And obviously, they've been trained for many years, so this is their job, and so they do that. And then after they scanned the pictures, they were asked if they'd found, if they'd seen anything unusual or strange. And most of them didn't, I can't remember how many, it was half of them or 30% of them

Starting point is 00:31:53 whatever, most of them didn't see anything. And it turns out that a fairly large, like not huge, but fairly large picture of some animal. I can't remember, was it a bear or something? The researchers had just stuck a random picture of some animal on there. It was really quite obvious if you just looked at it. But the point is, the doctors weren't looking for animal faces in this picture of a lung. The context would have meant that that would just be absurd to find. They were also focused on finding tumors or other potential lesions or other problems

Starting point is 00:32:19 that weren't looking for animal faces. And so this is a strong piece of evidence that what we observe in terms of literal visual perception is strongly shaped by what we're expecting to see. So that's a prediction of the top-down processing model. Interestingly enough, they also did eye-tracking, so it's possible to observe what part of the visual scene one is observing, and basically that's where you're directing your gaze, so what part of the visual scene is falling on your fovea.

Starting point is 00:32:41 And it turns out that most of the doctors, in fact, I think pretty much all of them, looked directly at the animal face at least once or twice, including the doctors who didn't see it. So they looked right at it, with the light falling on their fovea, which is the highest, vision of the retina that has the highest level of resolution, so on, and they still didn't see it.

Starting point is 00:32:59 So what's going on with there? The information obviously was processed by their brain, but it somehow didn't get to conscious awareness. So I think this is very strong evidence for a top-down roles in processing, where although the information likely got right through V1, maybe even to higher cortical levels, but it didn't get to the cortex. It didn't get to that region of conscious awareness where they actually perceived. So anyway, overall, I think there's clear evidence for both bottom-up and top-down processing in vision, and including an object recognition. The hierarchical sort of structure of information flow through the visual systems, you know, from the retina to the LGN to V1, V2, V3, V4 and so on is, I think, clear

Starting point is 00:33:34 evidence of this. However, the many backpropagations that exist in these areas as well is evidence for the top-down model, also the various psychological experiments that I talked about, or also evidence for the top-down model. Generally, though, I'm particularly partial to distributed representation theory, which basically says that objects are not coded for by particular specific cells. This is basically in opposition to the grandmother cell theory. The grandmother cell hypothesis, you've got one neuron for one particular object or type of object. Distributed representation theory says that what the brain really stores is patterns of activation. So you've got a pattern of activation that corresponds to motion in this direction or

Starting point is 00:34:08 this particular texture or this particular shape and so on. So basically you can grab all these different potential types of activation and combine them into various different patterns. And these different patterns will correspond to different objects. So, for example, if I have a banana, I'll have a pattern of activation of neurons that maybe from V4, for example, which corresponds to the yellow color. I'll have another one which corresponds to the bent shape of the banana. I'll have another one that corresponds to the size of the banana. Maybe I'll have some others that related to the memory of what it's like to eat a banana

Starting point is 00:34:36 and other things like that. And then all of these patterns of activation and or memories coming from various parts of the cortex, including perhaps V1 or V2 or wherever, you know, the direction of the edges and the orientation of lines of the banana. All of that feeds up through to whatever. a higher cortical region is responsible for saying, ah, that's a banana, that pattern of activation corresponds to a banana. If I change the

Starting point is 00:34:57 pattern of activation and instead change the color and change the texture and change the line orientations and so on, maybe that new pattern of activation corresponds to an orange or a car or something completely different. So hopefully you can see the difference here between distributed representation theory where it's patterns of activation comprised of lower elements and grandmother neuron theory

Starting point is 00:35:13 where you have a single neuron that just responds and tells you that that's what the thing is. One big problem, however, with distributed representation theory is called the binding problem. Now, the binding problem refers to a number of problems in neuroscience and cognitive science generally, but I'll just talk about this particular application of it. Okay, so if we think that your perception of a banana is really just built up out of lower level bundles of attributes like size and its shape and color and places when you associate with using bananas and so on and all that

Starting point is 00:35:42 sort of stuff, if it's just built out of those lower things jumbled together, then how does the brain actually combine those things and consider them to be part of a whole? In other words, how does it take the color and the texture and shape and put those all together into a single percept so that we see a banana instead of just yellow, bent, tasty, whatever, or we see a tree instead of just a bunch of the shapes and the colors and textures and so on. How do we actually put it all together? How are the different percepts bound together? Obviously, it would have to be very selective because it wouldn't be any good to bind everything together because then we just observe the entire world as a single undifferentiated object. We don't want to do that. We want to observe the world and what we do

Starting point is 00:36:17 observe the world as distinct objects, you know, here's one thing, here's another thing, here's something else. Somehow the brain's got to bind together a bunch of characteristics that correspond to this object and then a bunch of different things that correspond to this other object over here, and they have to be kept separate from each other but sort of combined together. There's no real agreed-upon answer to how that occurs. That's very high-level stuff we're talking about here, and it's very hard to study. About the best explanation that's been come up with is temporal synchrony of firing of particular patterns. So in other words, the brain knows that this is a, to put the color and shape in someone of a banana together, because all of those

Starting point is 00:36:48 patterns of activation fire, or are active, the neurons are active, at the same time. There are various problems with that. I mean, obviously, if you're doing two objects at once, how do you distinguish one from the other? I suppose you could argue something like, well, you can only basically view one object with the foveer at once, so maybe the brain just looks at any patterns of activation that occur from input that comes from the fovea, and that is bound into one object, that's a possibility, but there are other problems with that as well. Another possibility is that common location within the visual field has been suggested as a mechanism. So maybe there's some sort of meta-edge detection that the brain uses to detect single objects,

Starting point is 00:37:21 and then everything inside those broad edges is classed together as a single object, and the patterns there are bounded together. This might explain things like why, you know, what counts as a distinct object, depends upon how far away you are from something. So a car looks like a single discrete object with all of its properties sort of bounded together. you perceive it as being a collection of its shapes and parts and colors and textures and so on rather than a bundle of those separated from each other. But if you view a car very close up, then it actually looks like it's made up of distinct parts,

Starting point is 00:37:50 which may not be necessarily close to the link to each other. It's made up of the wheel and the tires and the windscreen wipers and so on. So that would be consistent with common location in the visual field, because when objects are far away, that basically all of those perceptions come from the same point in the visual field, whereas when you go close up, it spreads out over the visual field, and so they no longer seem to be the same thing. I don't know if temporal synchrony can explain that to a problem in the same way,

Starting point is 00:38:11 but this temporal synchrony and a common location within the visual field are two competing hypotheses for how the binding problem is solved. We've finally come to the end of our journey of beginning at the eye and travelling right through the retina, the LGN, the primary visual cortex, high levels of processing, and then talking about some of the more abstract computational levels of processing that the brain undergoes in order to make sense of the visual scene. If you enjoyed this episode, please jump onto iTunes and give me a favourable review, More reviews attracts more people to the podcast by basically getting me higher ratings.

Starting point is 00:38:41 You can also visit podcast website at FODs12.podbean.com, where we have an archive of all prior episodes. Also, you can jump onto Facebook and search for The Science of Everything podcast and give us a like. Again, liking the podcast page helps to spread the word. And on the Facebook page, you can also look at some visual aids and images that I posted up from some past episodes that will hopefully aid your understanding. Also, I'd love to hear from you. Send me an email at FODs12. that's F-O-D-S-1-2 at gmail.com with your questions, feedback, criticism, whatever. Love to hear from you.

Starting point is 00:39:12 Thanks for listening, and I'll talk to you next time.

The Science of Everything Podcast - Episode 113: Visual Processing

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.