Why only one layer of perceptrons? Why not send the output of one layer as the input to the next layer? Combining multiple layers of perceptrons, sounds like it would be a much more powerful model. However, without using non-linear activation functions, all the additional layers can be compressed back down into just a single linear layer, and there is no real benefit, you need non-linear activation functions. Therefore, sigmoid or hyperbolic tangent or tanh for short, activation functions started to be used for non-linearity. At the time, we were limited to just these, because we needed a differentiable function since that fact is exploited in back-propagation to update the model weights. Modern-day activation functions are not necessarily differentiable, and people didn't know how to work with them. This constraint that activation functions had to be differentiable could make the networks hard to train. The effectiveness of these models was also constrained by the amount of data, available computational resources, and other difficulties in training. For instance, optimization tended to get cut in saddle points. Instead of finding the global minimum, we hoped it would, during gradient descent. However, once the trick to use rectified linear units or ReLU's was developed, then you can have faster turning like, 8-10 times, almost guaranteed convergence for logistic regression. Built-in of the perceptron, just like the brain, we can connect many of them together to form layers, to create feed-forward neural networks. Really not much has changed in components from the single layer perceptron. There are still inputs, weighted sums, activation functions, and outputs. One difference is that, the inputs to neurons not in the input layer, are not the raw inputs, but the outputs of the previous layer. Another difference is that, the weights connecting the neurons between layers are no longer a vector, but now a matrix, because of the completely connected nature of all neurons between layers. For instance in the diagram, the input layer weights matrix is four by two, and the hidden layer weights matrix is, two by one. We will learn later that neural networks don't always have complete connectivity, which has some amazing applications and performance like with images. Also, there are different activation functions than just the unit step function, such as the sigmoid, and hyperbolic tangent or tanh activation functions. Each non-input neuron you can think of as a collection of three steps packaged up into a single unit. The first component is a weighted sum, the second component is the activation function, and the third component is the output of the activation function. Neural networks can become quite complicated with all the layers, neurons, activation functions, and ways to train them. Throughout this course, we will be using tensor-flow playground, to get a more intuitive sense of how information flows through a neural network. It's also a lot of fun, allows you to customize a lot more hyper-parameters as well as providing visuals of the weight magnitudes, and how the loss function is evolving over time. This is the linear activation function. It is essentially an identity function because the function of x just returns x. This was the original activation function. However, as said before, even with a neural network with thousands of layers all using a linear activation function, the output at the end will just be a linear combination of the input features. This can be reduced to the input features each multiplied by some constant. Does that sound familiar? Is simple linear regression. Therefore, non-linear activation functions are needed to get the complex chain functions that allow neural networks to learn data distributions so well. Besides the linear activation function where, f of x equals x, the primary activation functions back when neural networks were having their first golden age, was the sigmoid and tanh activation functions. The sigmoid activation function is essentially a smooth version of unit step function, where x asymptotes to zero at negative infinity, and x asymptotes to one, at positive infinity. But there are intermediate values all in between. The hyperbolic tangent or tanh for short, is another commonly used activation function at this point, which is essentially just a scaled and shifted sigmoid with its range now, negative one to one. These were a great choices, because they were differentiable everywhere, monotonic, and smooth. However, problems such as saturation would occur due to, either high or low input values to the functions, which would end up in the asymptote plateaus of the function. Since the curve is almost flat at these points, the derivatives are very close to zero. Therefore, training of the weights would go very slow or even halt. Since the gradients were all very close to zero, which will result in very small step sizes down the hill during gradient descent. Linear activation functions were differentiable monotonic and smooth. However as mentioned before, a linear combination of linear functions can be collapse back down into one. This doesn't enable us to create the complex chain of functions that we will need to describe our data well. There were approximations to the linear activation function, but they were not differentiable everywhere. So not until much later do people know what to do with them. Very popular now is the rectified linear unit, or ReLU activation function. It is non-linear, so we can get the complex modeling needed and it doesn't have the saturation and the non-negative portion in the input space. However, due to the negative portion in the input space translating to a zero activation. ReLU layers can end up dying, or no longer activating, which can also cause training to slow or stop. There are some ways to solve this problem. One of which is using another activation function called, the exponential linear unit or ILU. Is approximately linear in the non-negative portion of the input space and is smooth, monotonic, and most importantly, non-zero and the negative portion of the input space. The main drawback of ILU's are that they are more computationally expensive than ReLU's, due to having to calculate the exponential. We will get to experiment more with these in the next module. If I wanted my outputs to be in the form of probabilities, which activation function should I choose in the final layer? The correct answer is, a sigmoid activation function. This is because, the range of a sigmoid function is between zero and one, which is also the range for probability. Beyond just the range, the sigmoid function is the cumulative distribution function of the logistic probability distribution whose quantile function is the inverse of the logic, which models the log odds. This is why it can be used as a true probability. We will talk more about those reasons later in this specialization. Tanh is incorrect, because even though it is a squashing function like a sigmoid, its range is between negative one to one, which is not the same range as the probability. Furthermore, just squashing tangent into a sigmoid, will not magically turn it into a probability, because it doesn't have the same properties mentioned above but allows a sigmoid output to be interpreted as a probability. To correctly convert into a sigmoid, first you have to add one and divide by two, to get the correct range. Also, to get the right spread, you'd have to divide tanh's argument by two. But you've already calculated tanh, simply repeating a bunch of work and you may as well have just used a sigmoid to start. ReLU is incorrect, because it's ranges between zero and infinity, which is far from the representational probability. ILU is also incorrect, because it's ranging between negative infinity and infinity.