Let's do a quick review of the training loop. Assuming that we've already split our dataset into training, validation, and test datasets, we do the following. First, we take a pass through our training dataset. This means that we go through and feed each sample into our model. For each sample, our model will make a prediction based on the samples features. We'll then compute the loss between the model's prediction and the samples label. The loss is a numerical value representing how far the prediction is from the label. Low loss is good and high loss is bad. The model will then update its parameters in a way that will reduce the loss it produces the next time it sees that same sample. This is known as an optimization step. Periodically, for example, after we've taken a pass through our training dataset, we can evaluate our model on a validation set. In this phase we assess the parameters that the model has learned, produce accurate predictions on data that it has not yet observed. In other words the validation set. The model does not learn from these samples because we do not execute the optimization step during this phase. Without the optimization step, the model cannot update its perimeters which in turn prevents learning. We use the validation set as a measure of how the model will do in the real world. We save a version of the model if it gives us the best validation performance that we've seen so far. Steps two and three comprise the training loop. We repeat these steps repeatedly until the model has converged. When the model has converged, it means that continued to optimization is no longer reducing the loss much on the training dataset. Now, once we've converged, we go through and pick out the best model or the model that produces the best predictions for the validation set. We do the whole process about multiple times, each time with different training configurations. This is known as hyperparameter tuning. Different training configurations or hyperparameters often produce models of different performance. Once we're happy with our model's performance on the validation set, we then evaluate it one more time on the test set. This is the number that is reported in publications or by commercial algorithms. Now, in order to better understand how neural networks operate relative to other machine learning algorithms, we need to dive into one particular aspect of the training loop, the optimization step. The optimization step is the point at which the parameters of the network are updated. There are three components of the optimization step that we will cover; loss, gradient descent, and back propagation. We've been leading up to the concept of how exactly a model learns through trial and error, so how does the model know if it's getting things right or wrong? Well, the answer here is something called loss which we've covered a little bit before. Loss is a key concept because it informs the way in which all of the different supervised machine learning algorithms determine how close their estimated labels are to the true labels. Let's consider a simple example using a one dimensional dataset with a function, so this will be one feature and the function will be a line. In this example shown on the screen, the circles represent the true labels for a given x, while the line represents some prediction function. In general terms, the example on the left will have a higher loss. How can we tell that? Well, the line is further away from the circles overall than the example on the right. Again, the line is the function and the x is the examples. The goal of training an algorithm is to find a function or a model that contains the best set of weights and biases that result in a lowest loss across all of the dataset examples. As we alluded to earlier, the loss is the difference between the model's guess based on the data and the actual correct label. To learn during training the model calculates the loss or how badly it missed the true label, and then adjust based on the loss in order to minimize the loss. Again, the idea is to minimize the loss. A high value for the loss means that our model performed very poorly, a low value for the loss means our model performed very well. At this point you might be thinking to yourself, what if I could create a mathematical function that could process all of the individual losses to come up with a way to decide how well a model performs. If you did, you'd probably call it a loss function and you'd be right. In a diverse field like machine learning you can bet that there are many different types of these loss functions out there, and choosing among them requires an understanding of the data you're using, as well as the task you're asking the model to solve. There are commonly used loss functions that you should be familiar with and understand why they are important. But just so you remember that there are several types and the choice is very dependent on the data and the task. Each loss function has unique properties and helps your algorithm learn in a specific way to create the desired function or model to fit the data in the way that you want. Some may put more weight on outlier labels, other on the majority labels, etc. Here we're just going to cover a few of the most common loss functions so that you have a better grasp on this concept, which will help your overall understanding of the concepts. We'll start with something called mean squared error. Mean squared error is the simplest and most common loss function. This one is pretty much as fundamental as regression in any or all machine learning courses. To calculate the mean squared error, you take the difference between the models predictions and the true label, which is also known as the ground truth, square it and then average it out across the whole dataset. That's pretty much it. By the way, the reason that we square is because we don't care if the error or difference between the prediction and the ground truth is positive or negative, we just care about the magnitude of the error and want to minimize this. Squaring gets rid of the positive versus negative sign of the error. The squaring has another benefit as well. The mean squared error is great for ensuring that our trained model has no outlier predictions with huge error since the mean square error puts a larger weight on these errors, essentially a disproportionately larger loss due to the squaring part of the function. There is another type of loss function that is similar called the mean absolute error. As the name implies, it is not very different than the mean squared error, but it does provide in some sense some opposite properties. This can actually make it confusing so please pay attention to the terms here. The MAE is different because we will instead apply the absolute value to the errors instead of squaring them. The MAE still removes the negative numbers, meaning that a negative two will be treated the same as a positive two, but the key difference from the MSE is that since we did not square the difference like we do in MSE, the values will be on a linear scale in the MAE rather than in an exponential one. So when deciding whether to use MAE or MSC, there can be pros and cons based on the problem at hand, but much of it boils down to what error characteristics are better for the use case. If reducing an already small error closer to zero has the same value as pushing a larger error down by the same amount, then MAE might be a good choice. On the other hand if a small but non-zero errors are in some sense already good enough, and it would be acceptable to have these if we have greater reduction in the larger errors from outliers, then MSE is a better choice. That covers mean squared error and mean absolute error. We will talk again in the next video about more loss functions.