[MUSIC] In this video we'll continue our discussion of neural networks and we'll talk about multi layer feed forward ANN. So, ANNs are used for learning complex systems and for complex multi layer networks are needed to solve more difficult problems. So, we need to include hidden layers of nodes which allow combination of linear functions. But the question now becomes how do we update hidden layer? So, for the problem which is given on the right consists of two classes A&B. As we can see that the decision boundary is very haphazard. It would be difficult to model. So, for this we need a neural network which consists of several hidden layers and several neurons. So, now coming back to the question how we can update the weights of the neurons for the hidden layer? The solution to multi layer ANN weight update is the back propagation algorithm. In this, the global error is backward propagated to network nodes, in which weights are modified proportional to their contribution. Then the error is calculated based on the difference between target and actual and is given as summation of ti minus oj squared. The rate of change of the error is provided as a feedback through the network using delta Wij. As minus eta times partial Derivative of E with respect to wij. So gradient descent in weights space towards a minimum level of error. Using fixed learning rate eta. In the back propagation algorithm, The first step is to initialize weight wij which is the weight from node i to node j and then a pattern is presented to the neural network and a target output. The weighted output. OIJ is computed as 1/1 plus E to the power minus Xj, assuming we are using a sigmoid activation function. Where xij exercise is the sum of wij times oi that is total weighted input for the node j. The next step we update the weights WIJ at step T plus one is WIJ at step T plus delta W. Where delta W is given as minus eta times partial Derivative of E with respect to Wij, Next. The steps 2- four are repeated until acceptable level of error is achieved. So, now let's look at an example of how we can train neural networks and go to this website which is playground dot tensorflow dot org. So, in this there are 4 2- dimensional dataset. One is a circle and ring data set. One is the XOR dataset. Then we have two Gaussian clusters and the 4th one is a spiraling data. All the data set consists of two classes shown by blue and orange colour. There are seven features available which can be provided as an input to the Neural Network. X, Y, X square, Y square, XY, sin X. and sin Y, we can vary the number of hidden layers and the number of neurons in each layer. So let's look at the circle data. So, if we use X one and X two as features and vary the learning rate then what is the effect? So, we can check the effect of slow versus fast learning rate. In slow we'll take a value of 0.003 And in fast let's say 0.1 for value of 0.1 we will get quick convergence at around 40 epochs. Whereas for 0.003, the convergence is very slow and takes around 1,000 epochs. Now, what happens when the learning rate is too small? Let's say 0.00001 or too large. say 10. For two small learning rate. It will take forever to get a solution. If learning rate is too high there is no convergence. The systems keep on jumping from one state to another and the task two again for the circle data set. So, what happens if we include x square and y square as features as well? in this case we get very quick convergence even for slow learning, Why? Because these features make this data easily separable. The points which are inside circle have a very low value of x squared plus y squared whereas the one on the outer ring has a high value. So, this problem becomes linearly separable, and hence can be easily solved if you look at the XOR dataset or exclusive or data set. So, in this as we can see the cluster one shown in the blue colors are in 1st and 3rd quarter and the second class is in 2nd and 4th quarter. So, what features do we expect to perform the best on this data As we see features such as X, Y, X square and Y square does not perform well. But as soon as we include the XY feature, the algorithm converges quickly, why? Because for the blue colored class points multiplication of X and Y is a positive whereas for orange color, it will always be a negative number, and hence, this problem becomes linearly separable. Now, let's look at a slightly complex spiral data. So, for this data, no matter what parameters we set we are not getting good results because it is a highly complex data. It has a haphazard decision boundary. So, how can we make it better? For this we can increase the number of hidden layer and increase the number of neurons in each layer so that it can learn even a very complex decision boundary. Now, we'll discuss a few network design and training issues. So, the first one is the design in which we need to consider architecture of the network structure of the artificial neuron and the learning rules in training, we need to ensure optimal training. That is the model convergence learning parameters are also very important and the data preparation is important. We need to choose appropriate feature so that ANN can learn it appropriately. Next we need to determine the number of network weights, how many layers to use and how many nodes per layer. So, we need to choose input layer, hidden layer, how many hidden layers should there be and the output. There are several methods for this. Some of them are automated methods which are based on augmentation that is, it cascade corelation. It begins with a minimal network and automatically trains and add new hidden units one by one unless we get acceptable performance. The opposite of it is a weight pruning and elimination method which assumed that sufficiently large ANN. Is already trained, it then automatically reduced network sizes, increase its generalization abilities and overcome overfiting In terms of connectivity we can have either a fully connected network. In which each node in a given layer is connected to all the nodes in the next layer. We can also add some constraint to the connections such as selective connectivity or shared weight in which the weight given to many edges is same. You can also have recursive connection. The choice of input integration method for each neuron node can also be different. We can have some in which we just add the input or we can square the inputs and then add them or multiply them. Once this combination is calculated we need to choose the activation or the transfer function. We can have either a 0-1 step function or minus 11 step function, a relu way activation sigmoid or tanh activation. To select the learning rules, there are several options such as generalzed delta rule which is the steepest distant algorithm or momentum descent or advanced weight space search techniques. Or we can have global error function that can vary such as normal quadratic or cubic to ensure a well trained network. Our aim is to achieve a good generalization accuracy on new examples. So, when the amount of available data is large, we can use separate validation data so we can divide our data into training validation and test. However, if the amount of data is not large, we need to use cross validation to master in ANN parameters. The parameters such as learning rate can have the range from 0.01 to 0.99. But typically it is Advised to use a learning rate close to 0.1 In terms of momentum it can have any value from 0.1 to 0.9 but Advise value is 0.8 network weight initialization. We can choose random initial weight within a range in which a smaller weight for nodes with many incoming connections. The typical convergence behaviour looks like in the left figure, which is the ideal case in which the error decreases rapidly and then it approaches zero. The middle figure shows a saturated total era in which the error becomes constant after some time. This is not a local minima and we can reduce the learning parameter to get better results. The figure on the right shows that the error fluctuates even after several iterations. That means that data may not be learnable the quality of result relates directly to the quality of the data. If you feed garbage to the model, it will feed garbage out. So, there are three step process. First is consolidation and cleaning the data, then feature selection and Pre-processing and then transformation and encoding back propagation ANNs accept only continuous numeric values between the range 0.1 and 0.9. So, to apply this type of ANNs, we need to modify the data or pre processing so that if the data is in any other form, it needs to convert it to this form. In this video, we discussed several issues that we face while training an ANN model and how we can solve. Thank you. [MUSIC]