Today we start the new module. Join us as we explore together convolutional neural networks. We talked about convolution. The convolution operation involved taking an image, let's call it here the original image F, and applying to it a kernel. This was the kernel. The kernel was the set of weights, remember that? Then the filter or the kernel, this we're going to call it a filter because that was its role. The filter was sliding along the image from left to right and up to down, and it was producing a new image. The filter values will act as weights and what we would end up doing here was a weighted sum. For the first location of the filter, we super-impose the filter on the image at this location right here. We perform the weighted sum and we ended up in the new image with the value here, 27. As we slide the filter, the filter moves, and we move it over here, we perform the weighted sum again and we ended up with 22, and so on. As the filter slides along the image, the weights are the same for each new pixel value calculation. If you remember back when we were studying this in the computer vision module, we came up with filters that when convolve with the image, they gave us information about edges and later about texture. Another thing that I'd like you to remember right now is that when we were doing convolution, we couldn't really apply the filter everywhere. There were some places in the new image where we could not get a value because we did not have enough values in the original image. This would mean to basically apply the filters somewhere here. We couldn't apply it somewhere here so we couldn't have a value here which is the center of the filter. That was convolution. Now how does convolution factor into convolutional neural networks? Just as we used convolutions in classic computer vision to figure out different patterns features like edges and texture, they have the exact same role in neural networks. Convolutional neural networks are actually defined by the fact that they have multiple hidden layers where the hidden layers are made up of convolutions, they're made up of filters. It's not just one, there are stacked filters, and we talk about a convolutional block. We are going to have multiple kernels, multiple filters, stack together, they are going to perform the convolutional operation over the image multiple times, and then there will be an output from that layer, that convolutional layer. What is the role of these filters? Just like we mentioned, instead of having one big filter, as big as the image, which we were doing at the end of the last module, we saw that was not really very effective. Instead of figuring out one big template, as big as the image, we are going to attempt to figure out a template for an object by recognizing that each object is made out of small parts. These small parts put together are going to help us finding specialized template for an object by putting together templates for each part, each component of the object. These filters will help us to detect patterns. At the beginning, in the early convolutional layers, we will detect simple patterns just like we would use in classic computer vision, there will be edges or just colors or simple shapes. Remember bars and spots from our discussion about texture? Hidden convolutional layers, they are later in the model, we'll put together this simple patterns to detect more global, more invariant, more complex components or features like eyes or ears or fur for example. Each convolutional layer will have a specific output and that will be passed on to the next layer. You can think of each convolutional layer as a feature extractor. Like I said, early in the model, it will be simple features. Later in the model, there will be different stages of the classification pipeline, basically, there will be more complex features. Now, let's try to figure out how this convolutional layer work, and let's try to sketch a diagram and try to visualize a convolutional layer and the activation map, which is the result of the convolutional layer. Let's say that we have our image, and our image is over here. I'm going to make it a little bit like a block because it's 32 by 32 by 3, we're talking about an RGB image. Like the ones from the CIFAR 10 dataset. Let's say we have a three-by-three filter. It's actually going to need to be 3 by 3 by 3 because these are the number of layers in the input data. If the input data has three layers, because we have an RGB image, then every filter is also going to have to have three layers, every convolutional filter. Now, let's figure out the size of the output. What is the output going to look like? We're going to have something that we are going to call an activation map. This is the activation map. Now, when we were overlapping in the big image, we were looking at how convolution works, and we were saying that really we cannot compute things everywhere. The first place where we can apply the filter is here and the first new pixel value we're going to get is there. We have all these problem areas, all these things that are not going to work. For a three-by-three filter, we will have no response, we will not be able to compute an output on the first row, the last row, the first column, and the last column. That means two less rows and two less columns. We're going to have here pretty much 32 minus 2, and again 32 minus 2, so only 30 by 30 will be the size of our output. This is output size. Now, we have one filter, but what if we have multiple filters? What if we have another one? Another three-by-three. That's what it means to have a block. What if we have another one here? Another three-by-three. What if we have like the first convolutional layer of the AlexNet architecture, then we have 96 convolutional filters , what's going to happen? It's okay, we can do that. This is the advantage of a convolutional layer. We can apply all these 96 filters. What's going to happen is that for each of them, we're going to get another activation map, and another activation map, and so on, until we have 30 by 30 by 96. So in this dimension, we're going to have 96 as the value of the depth, basically, of the output layer. The depth of the output layer, the number of activation maps, if you want to think about it that way, is the same as the number of convolutions in our convolutional block. Now, let's see how the size of the output change if we change the filter. We had our image here, and when we were doing a three-by-three, we can only put the filter here, and we would miss the first row, the last row, the first column where we wouldn't be able to compute values in the new image here and here, and here, and here. We had the 32 minus 2 by 32 minus 2, which became 30 by 30. What happens if we have a seven by seven instead? I'll try to make a big image here with multiple rows and columns. If it's seven by seven, how many more rows are we going to miss? Where is the first place where we can put our filter? Let's see. One, 2, 3, 4, 5, 6, 7. One, 2, 3, 4, 5, 6. Seven. Of course the image is bigger. I was able to add a few more here to give you the idea that the image is much bigger in this direction or in this direction. But this 7 by 7 filter is centered over here. That means in a new image, this is the 1st pixel value we'll be able to compute which means all of these, the 1st three rows. So 1, 2, 3, and then here at the end, the last minus 1, last minus 2; we also won't be able to compute, same thing with the columns. The 1st three, we won't be able to compute. The last three, we won't be able to compute. The change is that now we're going to have 32 minus how many? Here we were missing two rows; the 1st and the last, and two columns. Here we are missing six rows, the 1st and the last three columns. That means we're going to be 26 by 26, would be the size of the output. Take a look at the filter size and try to come up, there's a very simple formula, basically how 3 by 3 makes it so that we subtract 2, and 7 by 7 makes it so that we subtract 6. Try to figure out how this quantity here gets computed. How many rows and how many columns we are not able to access? It's a pretty simple arithmetic thing, but it's important we always try to figure out what is the size of the output, because the output becomes the input into the next hidden layer. Let's move on. We learned about convolutional layers and these filters that we apply multiples of them. What parameters are we trying to learn when we have a convolutional neural network versus a regular neural network. Some of them learned parameters, so these are learned parameters. Some of the learned parameters are the same. We're going to have to learn weights and bias values. Let's try to figure out. Let's assume that our input, our RGB color images. We have this image here, we said that was 32 by 32 by 3, because it's RGB, it's by three. If it's by three, that means that for one 7 by 7 convolutional filter, we're actually going to need to have 7 by 7, but also by 3. Because we need a weight for every single pixel value. If we have three layers of pixel values, then we're going to need three layers of weights. For every one convolutional filter, we're going to have 7 times 7 times 3. That is 49 times 3, that is 147 parameters. Let's just call them weights right now, and one bias value because we're following the same rules as any other neural networks, there are weights and then there's a bias value for non-linearity. We have in total, 148 parameters. Now, let's say that instead of having just one convolutional filter, we have a block of 10 convolutional filters of size 7 by 7. We're going to take this 148 parameters, we're going to multiply it by 10, and that one convolutional layer is going to have 1,480 parameters. If we have 10 filters, the size of the output. 7 by 7, we ended up. We know that with 7 by 7, we're going to get 26 by 26, the size of the output because we just computed that. Then there are 10 convolutional layer. The size of our activation map, will have a 10. This is really important. We're going to remember this, 26 by 26 by 10, because this is going to be the input here. 26 by 26 by 10 is going to be the input into our second hidden layer. Because we have this 10 parameter here, now our filters are going to be 5 by 5 but they're going to be by 10. This 10 becomes important in the number of parameters for each convolutional filter. This is going to have 5 times 5 times 10; 250 weight plus 1, bias value are going to be 251 parameters. But we only have here two, a block of two filters. Total, going to have 2 times 251. This layer will have 502 parameters. We can go on and on and say, okay, this next one, that's the size of the output, which means the next filters are going to have to have this step and the next hidden layer is going to have that many parameters and so on. It's very important to know and understand the number of parameters you're trying to compute. Sure, TensorFlow will summarize everything for you. But you should do these calculations on your own at first so that you don't really get surprised by these things. Besides the learned parameters, we have a whole bunch of other parameters that are called hyperparameters that we basically have to set from the beginning when we build the model. Some hyperparameters we have to set for all neural networks. How many hidden layers? How many neurons in every layer? What optimizer we're going to choose? What activation function we're going to choose? Others are specific to convolutional neural networks. For example, how many filters in a convolutional block? One of the things, for example, is that when we choose filters in a convolutional block, we often want four numerical computation size to have powers of 2. We want to come up with powers of 2. Again, the number of layers, the number of convolutional filters in one decides the output, the size of the activation map. Then we go on and on in the future layers, powers of 2 is something important to figure out how to remember 32, 64, 128. Another thing that is important is the window size. We just saw that it's the size of the filter. Early CNN's, were using big filters, 11 by 11, 7 by 7. More recently, 3 by 3, small filters are preferred to the big filters. Turns out, we can achieve the same accuracy. We can figure out templates by putting them together, smaller filters, putting them together later, basically. It reduces the number of parameters. There are even 1 by 1 filters, which it looks like they don't do anything. But in fact, if you think about it, the 1 by 1 filter, it's still a 1 by 1 by a certain depth. The depth maybe is decided on the output of the previous layer. If the previous layer has, for example, a 128 filter, a 1 by 1 filter in the following layer would basically have in effect the averaging all the outputs from the previous layer. 1 by 1 layers are used as well. Now let's make a comparison and let's look at the number of parameters that one layer of 7 by 7 filters will require, compared to actually having three layers, each of them consisting of stacks of 3 by 3 filters. Let's take a look at this for a sec. Here in the one layer of 7 by 7 filters, we have an input here. Those are the previous layer or images, you can think about it that way. They'll have a height, and they'll have a width, those will be dependent on the original size of the images and the previous filter sizes, and then some depth. Then, this is the input layer, and we pass it through our filter. There'd be a 7 by 7 filter, basically, by D. The depth was important. They're going to be, we'll see for a second how many filters we're going to have, and then we're going to come up here with the output size. We haven't yet talked about it, actually I'm going to go back to for just a second. Other hyperparameters that we're going to talk about soon are stride. How much do we move the convolutional filter as we slide it across the image? By default, it would be one. We just move it one pixel to the left, to the right, we'll just move it one pixel down, is traversing the image with our filter. Another thing is the padding. If we don't have padding, the size of our output always shrinks. Instead of 32 by 32, we end up with 30 by 30 or 26 by 26. But sometimes we want to pad because we want to preserve the size of the input. We want the output to be the same sizes. The input, that is achieved with padding. We're going to talk about it very soon. But what I wanted to say is that in this example, let's assume that there is padding, so then the height and the width is preserved. We're going to see in a second, the depth over here. Now, how do we do the same thing with 3 by 3 stacks of filters, three layers of them. The input wants to be the same. We can do a clear comparison, and then we're going to have a stack of 3 by 3 filters. Then we have another output. Then we have another stack of 3 by 3 filters, and we're going to see what the depth needs to be here. We'll have another stack of 3 by 3 filters in another output. In order for this to make sense, so we can compare apples to apples, what we're basically going to do is we're going to put the number of filters. The thickness of the convolutional block will also be D, so that we end up with D here. We're going to do the same here. There will be D filters here, and D filters here, so that we end up with the exact same output. Again, D filters here. Let's look at how many parameters do we have. In the case over here, we're going to have 7, times 7 times D. Those are going to be the number of weights for each filter, plus 1 by its value for each filter, and then we have D number of filters. We're going to have 49D squared plus D. Now, in the case of 3 by 3 filters, we're going to have 3 times 3 times D, again, plus 1 by its value times D. We're going to have 9D squared plus D. But we have three of these, 1, 2, 3. We have to multiply everything by 3. We have 27D squared plus 3D. If we compare these two quantities, no matter what the D value is, it's clear that stacking three layers of 3 by 3 filters, is going to have less parameters than one 7 by 7 layer. That's why more recently, smaller filters are preferred and multiple layers are preferred to the larger filters in less layers. The models are becoming deeper. We have deeper neural networks. We have a reduction of basically almost 50 percent in parameters. It's not exactly 50 percent, maybe like 45 percent, something like that. Next time, we'll talk about the stride in about zero padding, and we'll compute how that affects the size of the output as well. We're also going to talk about pooling layers and what they are, and what are their parameters and hyperparameters that we need to learn or we need to know before we start our model. Join us next time as we continue our discussion about hyperparameters and also about pooling layers.