So, why does it sometimes seem that even supervised machine learning models are so prone to solving the wrong problem? Well, for one they, by design, develop their own ways of solving a problem, rather than relying on step-by-step instructions from a programmer. Remember that we discussed that in the initial introduction. And secondly, most importantly, they have no contextual knowledge to understand the context of the possible solutions. Remember that there isn't common sense without context. And it's important to emphasize that even, or especially. When a health care model seems to be working well by all the metrics and benchmarks we discussed in the prior module. We still need to carefully consider and dissect the possible spurious correlations to prevent patient harm. And this is the reason why multidisciplinary teams with domain experts are absolutely critical to machine learning. In health care model development, evaluation and deployment, which is something we'll discuss more. The issues with correlations that we talked about in data sets that can be learned by algorithms. Are at the core of the ongoing tension in machine learning and health care that exists between what you'll hear referred to as black box, versus interpretable model algorithms. Many machine learning models for health care applications have been shown to be capable of reliably detecting atrial fibriliation from electrocardiography. Retinopathy from optical coherence tomography, and in-hospital mortality risk from electronic health records. And even more abstract clinical situations, such as predicting the onset of sepsis before clinician recognition. The machine learning models that have achieved this high performance are based on deep learning. And rely on vast amounts of data and complex model structures with millions of parameters. For example, the inception V3 model from Google. Which is more accurate than physicians at identifying diabetic retinopathy from fundus photographs, has 23 million parameters. This complexity can make it difficult to understand exactly how these models make a given decision or prediction, and have earned them the name black boxes. In contrast, models with fewer features are easier to visualize, understand, and explain, as we've seen with regression or decision trees, for example. So there's this trade-off that exists. To improve interpretability often means reducing the number of predictor variables, but that may lead to a reduction in the performance of the model. We can expect that deep neural networks will be the least explainable, whereas linear regression and decision trees will remain the most explainable. Because deep neural networks have been so high-performing, yet exhibit this weakness in terms of interpretability. There are many ongoing efforts to try to bring more interpretability to these neural networks. Some strategies include leveraging multidisciplinary teams to review false positive and false negative cases predicted by the model. And testing the model on external datasets to try to gain insight into causal versus correlative features learned by the model. Other strategies focus on developing computational methods to interpret neural network prediction. One example of this is looking at what part of an input, for example an image, mattered most to a model in making its prediction. This is referred to as the saliency of different parts of the input for the model. For images, we can visualize what are commonly called saliency maps, which are heat maps of a saliency over an image. There are different ways in which we can produce saliency maps. One approach, called class activation maps or CAM, is based on analyzing the neurons in the final layer of some types of neural networks. To compute how much the neurons that are important for any particular class are firing at every spatial location in an image. Mathematically, you can think of producing heat maps showing the neurons firing levels, or in other words, the neuron outputs or activation levels, at every spatial location for any given neuron. Then, to visualize the relative importance of spatial locations for predicting a particular class, we can plot a weighted sum of a neuron firing heat maps that we just described. Where each heat map is weighted by the importance of the neuron for predicting the class of interest, which is just a parameter value in the final layer of the neural network. Intuitively, this weighted sum of heat maps shows the spatial regions that most strongly trigger the firing of neurons, important for predicting the class of interest. An alternative approach, which also gives a notion of saliency. Is computing the change in prediction score that will result from a small change in the pixel value at a particular location of the input. Input locations where a change would greatly affect the prediction score, can be interpreted as salient for the model. Mathematically, we can compute this since this is just the gradient of the classifier score with respect to the pixel values. And we can also compute and plot a spatial heat map of these gradient-based saliencies. At this point, I also want to mention one terminology thing. Which is that you'll often hear various words being used that are related to the concept of interpretability, such as transparency, explainability, and inspectability. These terms mean essentially what they sound like. A model that is transparent, should allow us to easily understand how it works. A model that is explainable should be able to easily communicate why any particular output is produced. And a model that is inspectable should allow us to probe and inspect the functioning of any part of the model. But the terms are often used in overlapping and overloaded ways, and many people use them more generally to get at the overarching idea of model interpretability. In other words, they're always to get at the notion of opening up the black box of complex models such as deep neural networks. So, let's talk a little bit more specifically about what's the deal with the black box anyways. It's become quite common these days to hear people refer to machine learning systems as black boxes. The blackbox metaphor refers to a system for which we can only observe the inputs and the outputs, but not the internal workings. And for as long as there's been a machine learning field, there has been discussions and debates on the topic of interpretability. And referential metaphors about the black box. In health care, there remains concerns about black box models, even if they've been properly vetted and can reliably achieve high performance. This argument makes sense on its face. After all, we've seen with some of the examples earlier how badly things can go wrong when models are in spurious correlations in the data. But, even when we can properly vet a model to ensure that these spurious correlations don't exist, it's unclear whether complete explainability is actually necessary. As Geoff Hinton, a leader in AI, was quoted on the subject. Clinicians and regulators should not insist on explainability because people can't explain how they work for most of the things they do. Increasing model transparency can also have the opposite effect, and may contribute to dangerous levels of trust in the model. It can lead to over-confidence or over-reliance on the model and human error, and we'll talk more about this later. After all, many clinicians will tell you that sometimes they don't always know how or why they arrived at a particular diagnosis. Further, many common medications like aspirin and penicillin, were used for decades in patients, long before medical science understood how they worked. So when you think about the black box and model transparency. The question is whether it makes sense to hold AI models to a higher standard of interpretability than routine clinical practice. So however you land in response to this question, there's still a lot of work in the area of model interpretability and answers to the question of unlocking the black box. The first step towards answering this question is consider it as two distinct flavors of machine learning model explainability, intrinsic and post-hoc explainability. Intrinsic interpretability is simply referring to models, often simple models, that are self-explanatory from the start. In contrast to intrinsic interpretability is post-hoc interpretability. Which is used to understand decisions by complex models that do not have prescriptive declarative knowledge representations or features.