This is the last lecture from the classic computer vision method. Let's talk about object recognition. Let's start with the basics. What is object recognition? What are we trying to do? Object recognition is a very important task in computer vision research. We want to be able to identify objects in the image, and then we want to use this information to derive decisions like for self-driving cars, for tasks in robotics, other automated machinery. Pretty much every area of computer vision benefits from some object recognition result. But how do we define it? How do we formalize it? What format do we want to output the results of object recognition and how should we recognize objects? What exactly are we recognizing actually? For example, we could ask general questions, yes or no questions. We can say things like, is there a car in the image? Is there a person in the image? Is there a bicycle in the image? Those are simple yes or no questions. Is there a person in the image? Sure, there are some people in the image, we can see them. That's one way we can think about it. How else can we do it? We can say we want to detect people, we don't want just to say there are some people in the image, we want to actually do detection. What is different about detection from saying is there a person in the image? Detection means we're going to have to have a position, a localization for our object. We're going to say we detected, we're going to put something like a bounding box we call around the object, we detected one person here, we detected another person here, this is detection. That's very nice. Another thing that we can do is maybe look at the image as a scene, can we classify the scene? For example, is this indoors or outdoors? This is a very good example for that, right? Since this image appears to be taken inside the gym, definitely, it's an indoor scene. That's another way we can look at recognition. Not necessarily objects, but based on the objects we see in the scene, can we make inferences about the whole scene? Like is it indoors or is it outdoors. What else can we do with object recognition? We can look at object attributes. What do I mean by that? We can say, we have a person here and this person is with their back view. By comparison, we have another person here, and this would be a profile view or side view. We can say a little bit more about objects. We can pick another thing here and we can say, we have a ball and the ball is red and the material appears to be something like rubber. We can detect more than just the object, we can try to detect something about their pose, something about their texture or material. What else can we detect in images together with objects, we can detect activities. We can say something about them. People, the person in the image, we can say something about different class of objects. But for example we can say, we have people here, persons here, standing, but we have another person here and this person is climbing. Specially for people, we can have activities , standing, running, walking, climbing, maybe playing soccer. All sorts of things that we can hopefully infer from the image. Now, all of these things are super cool. If we could say so much about an image, that will be absolutely great. That will be our goal. But there are definitely many challenges with respect to object detection and classifying images, finding different classes of objects and images. The one that probably gives people the most trouble is the camera view point. Basically, depending from which viewpoint, from which direction we are seeing our object in the image, our object is going to look very different because from different viewpoints, we're going to be able to see different parts of our object. If we have, for example, a profile view like here and like here, we can only see one of the eyes, we can only see one of the ears. The nose looks different, the mouth is a little bit different. Then again, we have a frontal picture here but the part the dog particularly is laying down, so we can see that head and we can see the front two paws but we can't see very much the back. This is actually very important. It's not just for animals, you can think about it, cars, bicycles, buildings, all categories that the viewpoint contributes only certain elements of the object will be visible, not everything, so variations in viewpoint are going to be very challenging. As always in computer vision, illumination is another big deal. We always talk about in classic computer vision, is this algorithm resilient, invariant to rotation? You can think about a viewpoint as rotation of our object. Is it invariant to illumination? Is it invariant to scale? Those three things give computer vision scientists great trouble. Let's look at why illumination is such a trouble. On one hand, you can have a situation like here that depending on where the light source is, in this particular situation, that will be the sun because it's an outdoor image. Parts of the object are going to be lighted more than others. So the area here is more in the sun, the area here is more in the shade. Even when we think about face detection, light poses a really big challenge because just like in this example, one side of the image could be brighter in intensity than the other side so we won't get necessarily that perfect symmetry that we might expect. Another thing that could happen, so these are basically shaded areas because of the light source, but we can also have simply shadows from other objects. Same here shadows from other objects. Sometimes those shadows completely change the intensity over an area of pixels like here, and then will end up completely obscuring what's happening underneath. We'll end up completely having to figure out what exactly is happening with this part of the animal. It's almost like having an occlusion, which is another challenge we're going to talk about in a sec. Variations in illumination. The third one we were talking about were variations in scale. We are going to be looking at algorithms. We're going to be looking at these images where the object we are looking for, we are searching for could be very much as big as the image itself or very, very small. Here there's differences in the viewpoint from the same object or the same class. You can think about it as dog. Both the viewpoint profile versus back and scale, large-scale, small-scale. Very hard to have to search for an object at all the possible scales in all the possible positions, rotations, viewpoints. Those are the biggest challenges with respect to devising an algorithm. Let's see other challenges. So we talked a little bit already about occlusions. In real life scenes, we do not have just one object on some contrasting background. Sometimes we have multiple objects and some of them could be occluding an object of interest, something that we would be searching for and occlusion means that part of the object is not visible, so we're going to have to make decisions and recognizing, detect that object based on partial information. Could we have enough information to still detect in this case the dog and the cat? Depends on the algorithm and the strategy that is being used. Deformations, some objects are rigid. You can think about the car, the bicycle, as being rigid objects. They do not appear in any of the formed shapes, they will just have variations in viewpoint and orientation. But some objects are deformable. In this example, you see the exact same two pets, the dog and the cat in very different positions. But now, not only that, their shape is different, again, maybe the elements that you were thinking about of what makes up a dog, what defines the model of a dog are absolutely missing over here in this image. This would be hard to figure out what object it is, even from a human person. From a human interpreter. Another challenge, background clutter. We're coming back to two images we have already used. Real-life images, realistic scenes are cluttered, are crowded. There's a lot of things going on. We have to be able to detect both the things that are important to us and whatever it is in the background that might not be important for detection of a certain category. Many times we're looking to detect objects from many categories in the same image. For example, in the image here, we could detect the cat, but we could also detect the dog. We could also detect books. We could also detect a little bed. All things that are happening in the same image, lots of cluttered, lots of overlapping objects. Now, we looked at challenges in terms of algorithm. How are we going to identify all these images when there's so much variation when there's so much clutter, when there's so much going on in the image. But there's also a challenge with respect to scope. What are we trying to determine? How many object categories are we talking about? Is it 10? Is it 100? Is it 1,000? Is it 10,000? If you remember, the original ImageNet database had 3.2 million images. 5,000 concepts. We didn't call them categories at that time or objects, but concepts. Some things were a little bit more blurry. But today we're actually talking about billions of images. We're talking about half the world population owning smartphones. Phones with cameras. There is a projected 1.5 billion sales of new smartphones in 2021. That means a lot of data. A lot of images that we're talking about. The scope is large. We're trying to detect many different objects. Actually, I'm going to put another note here that is very important. We're talking about human recognizable object category. Something that a human would be able to label, to detect. Sometimes categories can be blurry like an object from an office environment or an object that you can sit on. Because chairs, for example, come in so definitely many shapes and colors, and many things could be perceived as something to sit on and not necessarily be a chair. But we're talking about how do we do this in an efficient way, not only detecting is that a car, yes or no. But is that a car, is that a bicycle, is that a building? Is that a person? Is that a lamp is that a rickshaw all things, all in the same one image. The scope of the challenge is really large. Another thing that is important, even if, let's say we manage to annotate images and we manage to say that this image has six people and three red balls. There is a net, and there is a blue board with round holes. Can we figure out contexts? Can we assign meaning to what we see in the image? Can we, for example, be able to say that it is this boy's birthday and he is, in a ninja gym with his family and he is watching his older brother while wearing a t-shirt he just got for his birthday. That is going to be really hard. Being able to detect individual objects, but then infer meaning from putting them all together is something that we are still working on. Definitely, classic computer vision did not get that far, but we will see deep learning is trying to put all this together.