Welcome to the second week of this course. This week, we will derive the Kalman filter algorithm which is a special case of something that is known as Gaussian sequential probabilistic inference. Last week, we looked at a lot of probability theory. In this week, there's going to be more linear algebra involved and I personally think that's a little bit more straightforward and I think that you will as well, as we proceed through this week and in following weeks. So, there's a lot of linear algebra steps that we need to develop but I will pay special attention to taking small steps so that you can follow along with every part of the development. We begin by defining some mathematical notation that I will use from this point onward. Whenever you see a variable in an equation that has a superscript minus sign, that will tell you that this variable is a predicted quantity. That means that this quantity has been computed somehow using values only of past measurements and not present measurements. Whenever you see a mathematical variable that has a superscript plus sign, that will tell you that this is an estimated quantity that is this quantity has been computed using values of past and present measurements both. Whenever you see a mathematical variable that has a symbol carrot or this hat on top of it. That indicates that this variable is either a predicted or an estimated quantity. It is some guess of the value of that variable dependending either strictly on past information or on past and present information. So, if you see a variable that has X hat plus that tells you a variable X is an estimate because it has a symbol hat which says that it's either an estimate or a prediction. Moreover, it has a superscript plus and that tells you it's an estimate or if you see X hat minus, that tells you the variable acts as a prediction because the variable X hat tells you it's either an estimate or a prediction and the minus tells you that it's a prediction. Whenever you see a mathematical symbol that has a tilde on top of it, the wavy line. That indicates that the variable is a difference in some way and almost always that it's an error between a true value and an estimated or predicted value. So, for example X tilde is always equal to the true value of X minus either an estimate of X or a prediction of X. We will be working with vector variables and their estimates, so we will also be working with the autocorrelation and autocovariance matrices of estimation errors. The symbol Sigma, capital Sigma is used to denote either a correlation or a covariance and here I share with you first that Sigma is used to denote a correlation between the two arguments and it's subscript. So, on the left Sigma xy is the correlation between random variable x and random variable y and if there's only one argument in the subscript it's an autocorrelation. So on the right you've got Sigma x being the correlation between x and itself. If the arguments to the subscript of Sigma are zero-mean, as they often are with the variables that we will work with, then Sigma represents a covariance. So, suppose that x minus x hat, which is what we call x tilde is zero-mean, and that y tilde is also zero-mean, then the correlation between x tilde and y tilde is the same as the covariance between x tilde and y tilde because both quantities are zero-mean. So, Sigma can also represent a covariance matrix in that case. With that background, we can start deriving the steps of the Gaussian sequential probabilistic inference. So, our objective is to compute a state estimate that is similar to the true state of the system in some way. Another way of thinking about this is that we want to make the error, the state estimation errors small the difference between the true state and the estimate, we want to make that small in some way. If I'm talking about a scalar, we can talk about the absolute value to talk about an error being small but we're talking about a vector, we have to think about it a little delay. What we're going to do is we're going to think about the length of the error vector between the true state and the estimated state and actually we're going to think about the length squared. The length squared turns out to result in simpler mathematics. So, our objective is to minimize the square of the length of the error vector. In mathematical notation, we are trying to find the value x hat such that the expected value of the square of the distance between x and hat is minimized and we also want to take into consideration any site information that we have and so we're going to use whatever measurements we have made up to and including this point in time during the estimation process. So, we are truly trying to find the value of x hat that minimizes this expected square length of an error vector conditioned on the knowledge of all measured values from our system and that's what this first line of the equation is saying. Arg min says return the argument that minimizes something. So, return the x hat that minimizes everything inside the parentheses. What am I trying to minimize? I'm trying to minimize the length squared of x minus hat given all the entire set of measurements fancy Yk which you remember is the set of Y0, Y1, Y2 all the way up to the present measurement of this system. The second line in the equation is expanding the definition of the length squared in terms of more familiar operations. So, it's x minus x hat transpose times x minus x hat. Then, the third line of the equation multiplies out this quadratic form. So, it's x transpose times x and minus x hat transpose times x and minus x hat transpose times x hat. I missed one in there but there's four terms over all that are multiplied out and combined in that third line. So, I need to minimize everything inside the parentheses, I need to minimize that expected value. From calculus, you know that we can often find the minimum of a function or the maximum of a function for that matter by taking the derivative of that function and setting the derivative equal to zero and that's exactly what we're going to do. We're going to take the derivative of this expected value and set the result to zero and find out what we get when we solve that expression. One problem that we run into with this approach is that we need to take the derivative of a function that involves vector quantities and taking derivatives of scalar equations to find minima and maxima, you've probably done many times before but you may never have done it for a vector valued equation and that's okay. In order to solve this problem, you need to know only the three identities that are on this slide and in fact, I think we use only two of them. To apply these identities, I first rewrite the equation that we desire to minimize from the previous slide as the first line of the equation that you see here and we take advantage of the fact that the derivative is a linear operator. The expectation is also a linear operation which means that the expectation of a sum of things is equal to the summation of the expected value of the same things. I can move a summation into and out of an expectation, I can move a linear operation into and out of an expectation and we do that quite frequently. So, here because the derivative operation is a linear operation I can say that the derivative of of an expectation is equal to the expectation of a derivative. I can move the derivative inside. So, when we do that, we need to compute the derivative of x times x, x transpose x with respect to x hat. Notice that must be zero because x transpose x is not a function of x hat. So, that first term goes away. Then, the second term, it has a two times x transpose times x hat with respect to x hat and that's the first identity on the slide. So, the Y transpose is equal to two x transpose in this case. So, we compute that result. Then, the final thing we need to find the derivative for is x hat, there should be a transpose there, times x hat and that's the third identity on the top if we let the A matrix be identity. So, the answer to that derivative is two times identity times x hat. So, bottom line, we use this dictionary of three identities and we apply it to this differentiation problem and we come up with an expression, which is the first expression on the second line. Then, we recognize that the expectation of this summation is equal to the summation of expectations because of linearity and furthermore, the expected value of a constant is itself so I write this as, well, I've got minus two times negative x hat inside of the expectation given Y. So, that's just going to be positive two times x hat. Then, I also have the expected value of x given Y multiplied by negative two and I can't simplify that yet. So, the second part of the second line is a simplification of the first part, recognizing that expectation is linear. Then, I set that equal to zero and I move the two components over two different sides of the equation. I write then, that the optimal state estimate of my system right now, x hat k plus is equal to the expected value of x hat k given all measurements up to and including time k. So, that's some pretty interesting. We now have this equation for the optimal state estimate and it might seem reasonable. The best estimate I can give you is what I expect it to be, given all the measurements that I've made right- until right now. The problem is I can't implement it. I can write it mathematically, I have, it's very simple looking but I don't know how to compute this expectation inside of a programming language. So, you might consider why I can go back and I can look at the definition of an expectation using integrals. Defining the conditional expectation by trying to approximate integrals and that's essentially what particle filters is trying to do, not exactly but that's basically the idea of a particle filter. So, it involves every time step approximating integrals and that's very, very computationally intensive. So, we're going to look for a different way of doing it. We desire to come up with a set of steps that can be implemented in computer code that somehow implement or compute this conditional expectation. When we are successful in doing so, we will have derived the Gaussian Sequential Probabilistic Inference Solution. It's going to take some time to define this algorithm. We will need to define some new notation and think about some concepts and conditional probability for us to get to the end of the solution but before too long, we will achieve our goal. We start by defining a prediction error, x tilde minus which is equal to the true value of the state minus the prediction of the state. Where we define the state prediction as the expected value of the state given all measurements up to the previous point in time only. It may be helpful for you to remember that we always, always, always define error as the difference between truth and prediction, or truth and estimate. In the sense that it's truth minus something in every case. It's not possible to actually compute this error in practice, since the true value is not known. If I knew the true value, I would not need to estimate the true value. So, I would not need to go through this entire process, but it's possible to come up with mathematics and statistics regarding this estimation error that are super helpful for us. So, we can prove some very important statistical results using this definition of prediction error. Ultimately, we're going to find an algorithm for estimating the true value of the state using only measurable values. We also define something called an innovation and I will use that word very, very frequently from now on. So, please remember it. The innovation is the difference between the true system output and by prediction of the system output, and we give this the notation y tilde. In some way, this innovation is what is new, or what is unexpected in the measurement that I make. I thought I was going to make the measurement y hat but I didn't, I measured something else. So, y tilde is the surprise factor, it's the newness, the new factor of the information contained in the measurement and that's why it's called innovation. So, y hat is the expected value of y given all prior measurements as the expected value of the present measurement, given all prior measurements and we'll be using that a lot too. It's possible to show that the prediction error and that the innovation are both zero mean. We're going to do so, using a method known as the method of iterated expectation. Iterated expectation of some variable x, first finds the expected value of x given y, which itself is a random variable that depends on y, the output of that first conditioning. Then, I take that output and I find it's expected value which turns out to be equal to the expected value of x. So, I've tried to write it carefully in this first equation in the bullet point. The first expectation is a conditional expectation given the PDF x given y, given that conditional PDF as the inner expectation. The outer expectation uses the PDF of y, which is why I have a subscript y there. So, let's try to find the expected value of this state prediction error. By definition, this is the expected value of truth minus prediction, and the prediction we've defined to be the expected value of the state given all previous measurements. Notice that this second expectation as in the iterated expectation form. Because of that, the second expectation is simply the expected value of x. So, the expected value of the prediction error is equal to the expected value of x minus the expected value of x which is zero. We can use the same kind of logic to find the expected value of the measurement prediction error, the expected value of y tilde to find that its value is also zero. So, our prediction errors will have zero mean and our measurement predictions errors will have zero mean also. Using a similar logic, we can show that the state prediction error is uncorrelated with all past measurements, and the reason is we have taken all of the information from the past measurements that was useful to us and somehow incorporated that into the prediction to get the best prediction that we could. So, whatever is left over from the past measurements is not useful because it's uncorrelated with what we desire to predict. So, the expected value of the prediction given past measurements can be expanded as shown in this equation to be the expected value of the state given past measurements minus the expected value of the state given past measurements, all given past measurements. The subtraction is zero, which as you saw earlier on the slide is also equal to the expected prediction of x itself. So keep those handy, we'll use those. With this information, we can begin to derive the solution that we are looking for. So, consider if you will the expected value of the prediction error given all measurements, not only just the previous measurements. So, on the equation on the slide, we first write the expected value of the prediction error given all measurements up to and including the present point in time. Then, we recognize that prediction error is equal to the true state minus the prediction. Then we recognize that expectation is linear, so the expectation of a summation can be written as the summation of the expectations. So, I write this as the expected value of the true state given all measurements, minus the expected value of the prediction given all measurements, and the first term on the right side is what we recognize as the estimate of the state or x hat plus the second is a little more complicated. Inside of the expectation, we have the state prediction. You know that the state prediction is itself equal to the expected value of the state given all prior measurements. So, once I've computed x hat minus, it's no longer an expectation, it's now a value, it's now a constant, and the expected value of a constant is itself. So, the expected value of x hat k, minus given Y is just the expected value of a constant given Y so it's equal to that constant which is x hat k minus. So, this first way of writing what we're looking for in this slide says that the expected prediction error given all measurements is equal to the estimate of the state minus the prediction of the state. Now, we look at expressing the same relationship in a different way. The expected prediction error given all measurements can be written as the expected prediction error given all past measurements, as well as the present measurements. So, I'm splitting the fancy y sub k, which is the set of all measurements up to and including time k and to fancy yk minus one which is all measurements up to and including k minus one only, augmented with the additional measurement at Y times k. We know that this must be equal to the expected value of the prediction error given only the most recent measurement, because we've said that the prediction error is uncorrelated with all prior measurements. All the information from prior measurements has already been taken into consideration. So now, I've got two ways of writing the relationship at the top of the slide, and they're equal to each other, so I can say that x hat plus minus x hat minus is equal to the expected value, of x tilde k minus given y. Or rearranging, I can say that the estimate is equal to the prediction, plus an update. The update is the expected value of the prediction error given the present measurement. So, the Gaussian sequential probabilistic inference is trying to find the best estimate of the state, it's trying to find x hat k plus. Here, we've shown that finding x hat k plus, can be written in terms of an x hat k minus in terms of our prediction plus a correction factor where the correction factor is the expected value of the prediction error given the present measurement. So, this is a sequence of steps that involves a prediction first, finding x hat k minus and then correcting that prediction with an update as shown. So, great. This is super important. We still don't know how to do it. So at this point, you can see the expression that we will be implementing inside of our state estimation algorithms. We will first predict the state and then we'll correct that prediction. In order to learn this, you first had to learn the meaning of the notation that we're going to use from now on. You learned that when I have a superscript minus sign, that refers to a predicted variable. When I have a superscript plus sign, that refers to an estimated variable. When I use a hat over a symbol, that means that I'm either predicting or estimating something. When I use a tilde, that means that I either have a prediction error or an estimation error, and you also learned that the symbol sigma refers to a correlation matrix, and when it's subscripts have tildes on them, it's a covariance matrix as well. In order to estimate the state of our battery cell model, we are going to solve a minimum mean squared error problem. This is attempting to minimize the length squared of the error between the true state and the estimated state. When seeking to solve this minimization, our initial solution is that the estimate is equal to the expected value of the present state given all measurements made from this system up to and including the present point in time. Can't implement that yet, so we have to do some more analysis, and when we did some additional analysis, we found that we could express this estimate as being equal to a predicted state plus a correction factor. So, this is a predict correct structure. So first, we predict the state then we correct the prediction based on the present measurement. So, we still don't know how to compute this state prediction and we also don't know how to compute the correction factor. So, the method isn't complete, we have a framework that we have to develop a little bit further before we have a complete solution, but we are making really important progress. So, we need to figure out how to compute the prediction first, and then we need to know how to compute the correction factor and actually we're going to do that in the reverse order. So, in the next lesson, you're going to learn how to compute this correction factor.