Now that you've seen a number of different evaluation metrics for both binary and multi-class classification, let's take a look at how you can apply them as criteria for selecting the best classifier for your application, otherwise known as model selection. In previous lectures, we've seen a number of different evaluation frameworks for potential model selection. First, we simply do training and testing on the same dataset, which, as we well know, typically overfits badly and doesn't generalize well to new data. As a side note, however, it can serve as a useful sanity check to make sure your software engineering and feature generation pipeline is working correctly. Second, we frequently used a single train test split to produce a single evaluation metric. While fast and easy, this doesn't give as realistic a set of estimates for how well the model may work on future new data. And we don't get a good picture for the variance in the evaluation metrics that may result as we do prediction on different test sets. Third, we used k-fold cross validation to create k random train test splits where the evaluation metric was averaged across splits. This leads to models that are more reliable on unseen data. In particular, we can also use grid search, using, for example, the grid search CV method within each cross validation fold to find optimal parameters for a model with respect to the evaluation metric. The default evaluation metric used for cross val score or grid search CV is accuracy. So how do you apply the new metrics you've learned about here, like AUC, in model selection? Scikit-learn makes this very easy. You simply add a scoring parameter that's set to the string with the name of the evaluation metric you want to use. Let's first look at an example using the scoring parameter for cross validation. And then we'll take a look at the other primary method of model selection, grid search. In the notebook here, we have a cross-validation example where we're running five folds using a support vector classifier with a linear kernel and c parameter set to 1. The first call to cross_val_score just uses default accuracy as the evaluation metric. The second call uses the scoring parameter, using the string roc_auc. And this will use AUC as the evaluation metric. The third call sets the scoring parameter to recall to use that as the evaluation metric. You can see the resulting list of five evaluation values, one per fold, for each metric. Now here we're not doing any parameter tuning, we're simply evaluating our model's average performance across multiple folds. Now in this grid search example, we use a support vector classifier that uses a radial basis function kernel. And the critical parameter here is the gamma parameter that intuitively sets the radius, or width of influence, of the kernel. We use GridSearchCV to find the value of gamma that optimizes a given evaluation metric in two cases. In the first case, we just optimize for average accuracy. In the second case, we optimize for AUC. In this particular case, the optimal value of gamma happens to be the same, 0.001, for both evaluation metrics. But as we'll see later, in other cases, the optimal parameter value can be quite different, depending on the evaluation metric used to optimize. You can see the complete list of names for the evaluation metrics supported by the scoring parameter by running the following code that uses the scorer's variable imported from sklearn.metrics. You can see metrics for classification, such as the stringprecision_micro that represents micro average precision. As well as metrics for regression, such as the r2 metric for r squared regression loss. Let's take a look at a specific example that shows how a classifier's decision boundary changes when it's optimized for different evaluation metrics. This classification problem is based on the same binary digit classifier training and test sets we've been using as an example throughout the notebook. In these classification visualization examples, the positive examples, the digit 1, are shown as black points. And the region of positive class prediction is shown in the light-colored, or yellow, region to the right of the decision boundary. The negative examples, all other digits, are shown as white points. And the region of negative class prediction, here, in these figures, is to the left of the decision boundary. The data points have been plotted using 2 out of the 64 feature values in the digits dataset. And have been jittered a little, that is, I've added a little bit of random noise so we can see more easily the density of examples in the feature space. Here's the Scikit-learn code that produced this figure. We apply grid search here to explore different values of the optional class weight parameter that controls how much weight is given to each of the two classes during training. As it turns out, optimizing for different evaluation metrics results in different optimal values of the class weight parameter. As the class weight parameter increases, more emphasis will be given to correctly classifying the positive class instances. The precision-oriented classifier we see here, with class weight of 2, tries hard to reduce false positives while increasing true positives. So it focuses on the cluster of positive class points in the lower right corner, where there are relatively few negative class points. Here, precision is over 50%. In contrast, the recall-oriented classifier, with class weight of 50, tries hard to reduce the number of false negatives while increasing true positives. That is, it tries to find most of the positive class points as part of its positive class predictions. We can also see that the decision boundary for the f1-oriented classifier has an optimal class weight of 2. Which is between the optimal class weight values for the precision and recall-oriented classifiers. Visually, we can see that the f1-oriented classifier also has a kind of intermediate positioning between the precision and recall-oriented decision boundaries. This makes sense, given that f1 is the harmonic mean of precision and recall. The AUC-oriented classifier, with optimal class weight of 5, has a similar decision boundary to the f1-oriented classifier, but shifted slightly in favor of higher recall. We can see the precision-recall trade off very clearly for this classification scenario. And the precision-recall curve for the default support vector classifier with linear kernel optimized for accuracy on the same data set. And using the balanced option for the class weight parameter. Let's take a look at the code that generated this plot. Take a moment to imagine how the extreme lower right part of the curve on this precision-recall curve represents a decision boundary that's highly precision-oriented in the lower right of the classification plot where there's a cluster of positive examples. As the decision threshold is shifted to become less and less conservative, tracing the curve up and to the left, the classifier becomes more and more like the recall-oriented support vector classifier example. Again, the red circle represents the precision-recall tradeoff achieved at the zero score mark, which is the actual decision boundary chosen for the trained classifier. For simplicity, we've often used a single train test split in showing examples of evaluation scoring. However, using only cross-validation or a test set for model selection or parameter tuning may still lead to more subtle forms of overfitting. And thus optimistic evaluation estimates for future unseen data. An intuitive explanation for this might be the following. Remember that the whole point of evaluating on a test set is to estimate how well a learning algorithm might perform on future unseen data. The more information we see about our dataset as part of repeated cross-validation passes in choosing our model, the more influence any potential held out test data has played into selecting the final model, not merely evaluating it. This is sometimes called data leakage, and we'll describe more about that phenomenon in another module. So we haven't done an evaluation with truly held out test set unless we commit to holding back a test split that isn't seen by any process until the very end of the evaluation. So that's what's actually done in practice. There are three data splits. Training for model building, validation for model selection, and a test set for the final evaluation. The training and test sets are typically split out first. And then, cross-validation is run using the training data to do model and parameter selection. Again, the test set is not seen until the very end of the evaluation process. Machine learning researchers take this protocol very seriously. The train validate test design is a very important universally applied framework for effective evaluation of machine learning models. That brings us to the end of this section of the course on evaluation for machine learning. You should now understand why accuracy only gives a partial picture of a classifier's performance. And be more familiar with the motivation and definition of important alternative evaluation methods and metrics in machine learning. Like confusion matrices, precision recall, f1 score, and area under the ROC curve. You've also seen how to apply and choose these different evaluation metric alternatives in order to optimize model selection of parameter tuning for a classifier to maximize a given evaluation metric. Finally, I'd like to leave you with a couple of points. First, simple accuracy may not often be the right goal for your particular machine learning application. As we saw, for example, with tumor detection or credit card fraud, false positives and false negatives might have very different real-world effects for users or for organization outcomes. So it's important to select an evaluation metric that reflects those user, application, or business needs. Second, there are a number of other dimensions along which it may be important to evaluate your machine learning algorithms that we don't cover here, but that are important for you to be aware of. I'll mention two specifically here. Learning curves are used to assess how a machine learning algorithm's evaluation metric changes or improves as the algorithm gets more training data. Learning curves may be useful as part of a cost-benefit analysis. Gathering training data in the form of labeled examples is often time-consuming and expensive. So being able to estimate the likely performance improvement of your classifier, if you, say, invest in doubling the amount of training data, can be a useful analysis. Second, sensitivity analysis amounts to looking at how an evaluation metric changes as small adjustments are made to important model parameters. This helps assess how robust the model is to choice of parameters. This may be important to perform, especially if there are other costs, such as runtime efficiency, that are critical variables when deploying an operational system that are correlated with different values of parameter. For example, decision tree depth or feature value threshold. In this way, a more complete picture of the tradeoffs achievable across different performance dimensions can help you make the best practical deployment decisions for your machine learning model.