In this lecture, we're going to talk about visual features. So far, we have talked a lot about points and lines. There were beautiful abstract concepts, which helped us infer a lot of information about where is the camera. And how far are objects, and what is the three dimensional geometry of this scene? But, how can we get these points? We cannot get them by clicking. There must be some automatic way to get there. As an example here, and this is something that will come in one of the next lectures, we have two views where we have correspondences of points. And out of these correspondences of points we can find the poses of the two cameras and then the positions of these points as triangulations in the scene. You see also that a lot of those correspondences between points are wrong and a lot of the points which exist in one image do not exist in the other. So what kind of points do we really want to detect? What are good features? For example, if we don't have a geometry problem, but we just have a problem where we're given an image and we have a database of many many images of places, and we want to find the closest image. You can see the first column of this picture, it is a query. I'm standing in front of this building. And I want to find what are the closest images in my database. This problem is called plays recognition but it's very similar to many other trivial tasks whether it is about objects, or places. And you see in this picture that we're able to find, by using a descriptor and a detector of characteristic points you're going to describe, we have found that five closest images and one of those is really close to the place where we are. So what we really want from those features? We need the detector, we need the automatic algorithm that tells us at which point on an x-y positions are these features, like this green point here. We see the green points in two images taken from two different viewpoints. And these viewpoints differ actually in the orientation of the camera. This makes the two images being a little tilted with respect to each other, and also there is a slight scale change. The size of things are slightly different. So when a point is detected on the left image, we really want it also to be detected in the second image. We talked about detection repeatability. So we want the same point to be detected even if we change the orientation and the scale between the two images. We call this property the detection invariance. We see in this case a drastic change in scale. And we really see a lot of wrong correspondences. But we also see a lot of good correspondences at many points which have been, again, detected in both images. What else do we want from those features? We want to be able to match these pictures. We want to be able, when we have the set of points, to say that these pictures look really similar. This is about a descriptor that we have to build around those points, in the neighborhood of those points, like this circles, we are seeing in this picture. When these pictures differ by scale and rotation, there might be other perspective effects because they have been taken from two different viewpoints. We want still the descriptor of the neighborhood around these points to be as close as possible. We all this property descriptor invariance. So the two main properties we want from pictures are invariant detection, and invariant description. Probably the most challenging of these properties is the scale invariance. How can we detect again points when the scale has drastically changed? And we have two main contributions in the literature about this. The first is automatic scale selection from Tony Lindeberg from Sweden form 1996. And then subsequently, the legendary paper about shift invariant feature transform from David Lowe, which is probably one of the most cited papers in the computer vision literature. We will explain in this lecture why shift is really scale invariant, and what the really the all invariant properties we get with these this detector and also the associated descriptor. Let us first check about a new notion of images, which is a notion of scale space. If you look carefully at this extent images, you will realize that by going from left to right and from top to bottom the images are more blurred. So, we have taken the original image of this roof, and we have applied repeatedly some blurring operator. How have we exactly obtained those images? Starting from the original image, we have applied what is called a convolution. We are not going to explain in this lecture, why this convolution. But it's really a sliding mask over the image. Where at every point, we take the inner product of this mask, and the underlying intensities in the pictures and the image. This convolution is, denoted with this star. And in this case, the mask we're using is the two dimensional Gaussian function. We show the formula of this two dimensional Gaussian function on the bottom and this formula contains a sigma which really gives the covariance of the function or how it's spread the two dimensional Gaussians. If we represent this function as an intensity, we get this block what we see on the top. This is an isotropic two dimensional Gaussian because this blob is really circular. So when we take this mask, and we slide it over the image, and we take up every position the inner prohect, which means we do a convolution. We obtained the image on the right, which is a blurred version of the original image. A series of this blurred version of original images which are convolutions with different sigmas of the original elements is called the scale space in which we can write as the function l of x y the pixel position and the associated sigma. And now that a presentation of the scale is a pyramid. So when we start blurring at some point, when we double the sigma, at which point we say we reached an octave, like in music, then we subsample and we get again a series of images of the next octave, and then we subsample again, a series of images, and we subsample. This gives us a pyramid of images. And this is an efficient representation of images, because we know that by starting from the coarsest level of the pyramid and going to the finest we can really reconstruct. The original image. Now let's look at just one picture on all those images in scale space. So we look at the picture what is inside this yellow link. We take all existing values and we plot it along scale so we have a function of intensity which is not any more a function of x and y but for a fixed xy we have a function along the sigma. The maximum across this function of scale is called the intrinsic scale of the local image structure. And we can obtain this intrinsic scale with this maximum, if this function is really scale normalized, and will explain why we need this normalization. Functions that can be normalized are the derivatives of the Gaussian function. Let us take as an example the second derivative. The second derivative of the two dimensional function is usually a matrix, a Hessian. But if we take the trace of it, we obtain what is called the Laplacian of Gaussian. This is a surface which looks like the Mexican hat and it has a very nice property that it can be approximated, there's a difference of two Gaussians with different sigmas. It also looks like a blob so when we slide it is mask while we're doing a compilation, it really looks like matching blobs. Why do we need this normalization? Let us look at this Laplacian in one dimensional case. Let's say that the original image has a wide blob like this rectangle on the left and let's take several with this Laplacian with several signals. If we start blurring and blurring, and blurring again, at the end the signal will fade out, and there will not have any amplitude. That's why we need a scale-normalization so that at the end, the integer always on the curve will remain the same and will not have this fading out. At some point, if we take a specific position x in these images, we're going to see that the disposition we're going to have a maximum. And this is what is the intrinsic scale and this is how we obtain the scale of the local neighborhood and they match automatically. Let's look at it as a specific example. Let's say we have a black blob, like a black circle. And let's say we take the inner product of the circle ways the Mexican hat, this Laplacian function which is in red. If we put this function exactly in the position where the blob is, we can see that it will match if the signal of the function is equal the radius divided by the square root of 2. So the intrinsic scale in this image is actually the radius of the blob. And we say that if we find actually the maximum, the one that fits best, then we can find the sigma which really corresponds to the actual structure in. As we mentioned already, the Laplacian of Gaussian, the one that looks like a Mexican hat has a very nice property, that it can be obtained in two different ways. One is indeed by taking twice derivatives, but another is by taking two original Gaussians like in this example one with sigma equal 1.2, and one with sigma equals 1, and subtracting them, and then we get something which is really very close to the Laplacian. When we do this over all the scale space images we have obtained, we are going to get an image like this, which has been really amplified here in order that it is visible on the slide. But, in general, it gives you the idea that it is something like a depiction of high contrast values. And you can see again in the finer levels, in the top levels of the scale space where we start from the original limits, we can see a lot details. While this detail disappears when we go to the closest levels of a scale space. Now we have seen in a one dimensional example how if we take one pixel and we take this as a function of sigma the maximum will give the intrinsic scale. Here we draw this as an xy over different scale layers and we we will see that there is one actually sigma which is here represented by the length of which will best fit our locker structure. In the actual algorithm, the way it works is like this, we have the x y image and we have several images in scale space. So for one point it has eight neighbors at the same scale and has nine neighbors at the scale which is coarser, and nine neighbors at a scale which is finer. In this 3x3x3 neighborhood if this point has the maximum response of the Laplacian, then we say, that this is a SIFT keypoint. We really recommend you to go and download a quite famous package by Andrea Vedaldi, who is Professor at Oxford, it's called VLFeat. And several examples we are running here, it is using this code. So, we have defined the key point of the SIFT as the maximum along x and y and sigma in a local neighborhood. And these are the points you see in this image. The value of sigma as this goes along the sigma and we get this maximum, the arg max, the argument of the maximum of the sigma gives the intrinsic scale, which is represented here as a circle. So you see that several points have several sizes of circles and did you see on the left where the squares becoming smaller, the circle smaller, wide on the right, the circles are becoming larger. Which means that indeed the scale we detected automatically corresponds to the local size of the image structures. This circle defines the support region of the feature. The detector is also rotation invariant because it is a maximum along x y and sigma, this maximum variation is invariant to the marginal rotations. So have defined where is a key point according to the definition of SIFT, and that this detection is invariant to sigma. It actually detects automatically intrinsic scale and rotation. What about its neighborhood now? How can we construct descriptor which will be also invariant to scale and rotation. This descriptor is invariant to scale because we normalize it. If we take the intrinsic scale radius we take these circles and we work these neighborhoods. All of them to a 16 x 16 region than this neighborhood. The 16 x 16 neighborhood is scale invariant. Would have lost this white information about scale, which we have kept on as a value together with original detection. What can we do about the rotation? So we have again this circle and what we do about the rotation is to have a histogram of all orientations and these are the contrast orientation or gradient orientation. And the maximum defines the local orientation. This defines locally a frame which we rotate so that the dominant orientation aligns with the x axis. So this way, the rotation information is gone and we have received a rotation invariant neighborhood. Now, we have described how the neighborhood is rotation invariant because we have actually rotated it locally and scale invariant. Because we have normalized everything to the same size of the neighborhood. But what is the actual description to build the actual descriptor what we do is we go locally and we take a histogram of all the gradient orientation. We don't take a histogram over the whole entire neighborhood. What we do, we divide it in small blocks and we take a histogram of these blocks. This gives us something like this, for four blocks for histograms. As we seen this real picture, each point is associated with these 4 by 4 grid. It's block in this grid is a histogram itself. It is a histogram on 8 orientations with 8 bins. So we have 8 bins in each block of this grid, and having 16 grids, we have at the end 128 values for all the bins of orientations. And you can see it represented with this very small green arrows in this picture. You see the histogram which orientation locally at every block of this 4 by 4 grid is dominant. You see for example that when we are at the corner of the checkerboard we have two dominant orientations which are orthogonal. This 128 by 1 vector does not contain any formation about how originally it was scaled and how originally worded oriented in the image. This information just comes together with the point x, y as a sigma and theta value. Let us look at some examples of SIFT detections. So this is an image from a roof which is required periodic pattern. We see several SIFT features and the associated descriptors. Some of the SIFT pictures have intrinsic scale which is large. They're represented with big circles. Some of them have rotations which are slanted like 30 degrees. Some of them are looking up and so on. Now, let us see whether we can match them, whether indeed we find the same points. The detection invariants we talked about, and whether the descriptors are all also the same. And we see in this picture, which actually have a very small overlap. Actually, you have really to think quite carefully about what corresponds to what in this picture, because they have been taken from viewpoints which are far apart. And you see although there are very small parts of the picture which correspond with, still find hundreds of correspondences. How can we use these correspondences? One application is to create an image mosaic. So we take two pictures from the same scene with a camera just rotated slightly from one to the other. We find the shift features and then we establish correspondences. After we establish correspondences we compute projective transformations. With these projects for measure we have to clean it up from all the outliers. So at the end we end up with this very clean set of correspondences with which we can take the inliers we can compute with least squares homography. And then we can really work the second image to the first. And this is the beginning of a mosaic. If you continue doing over multiple images you see how a whole mosaic can evolve over a whole panorama. Another application is really find where you are, if you don't know GPS and if you have a database of few images. Displace recognition is the retrieval part of it. We're going to see that after we nail down the area where we are, we're going to apply process poses imation in order to find where exactly we are standing, where we're taking this picture. But let's say we are in the new city. We have walked around we have seen many, many places and now we have a query image and we could want to find the closest image, that is really the similar to facade we're seeing here. What we can do is really detect SIFT points in all these images. And this particular image collection is from my home town in Greece. So we have a query image, we have hundreds of images and we want to find almost the same image. Not from exactly the same viewpoint, not from exactly the same time of day or weather, but still something that has been taken around the same location. You can see some good matches on the upper right, which are from several view points seeing the same facade. And we can see some, what we call medium matches, which are similar facades. And we can see some bad matches which are really irrelevant images. What characterizes a good match is really a good set of correspondences, and SIFT is really excellent at providing the set, because it has the best environment properties. So to summarize, a SIFT detector can automatically select a scale and compute the dominant rotation. The scale and rotation are saved and then the neighborhood is normalized. Because the neighborhood is normalized in size and in orientation. The descriptor will also be invariant with respect to scale and rotation. The fact that we also use a histogram, just the distribution of a gradient orientation makes the descriptor very robust.