I've said pretty much everything I want to say about sorting at this point but I do want to cover one more related topic. Namely the selection problem. This is a problem of computing ordered statistics of an array with computing the median of an array being a special case. Analogous to our coverage of quick sort the goal is going to be the design and analysis of a super practical randomized algorithm that solves the problem. And this time, we'll even achieve an expected running time that is linear in the length of the input array. That is big O of n for input arrays of length n, as opposed to the o of n log in time that we had for the expected running time of quick sort. Like quick sort, the mathematical analysis is also going to be quite elegant. So in addition these two required videos on this very practical algorithm will motivate two optional videos that are on very cool topics but of a similar more theoretical nature. The first optional video is going to be on how you solve the selection problem in deterministic linear time. That is without using randomization. And the second optional video will be a sorting lower bound that is why no comparison based sort can be better than mergeshort. Can have better running time than big O of n login. So a few words about what you should have fresh in your mind before you watch this video. I have definitely assuming that you watched quicksort videos. And not just watched them but that you have that material pretty fresh in your mind. So in particular the video of quicksort about the partition subroutine, so this is where you take a input ray and you choose a pivot and you do repeated swaps. You rearrange the array so that everything less then the pivot is to the left of it. Everything bigger then the pivot is to the right of it. You should remember that sub routine, you should also remember the previous discussion about pivot choices. The idea that the quality of a pivot depends on how balanced a split into two different sub problems it gives you. Those are both going to be important. For the analysis of this randomized linear time selection algorithm I need you to remember the concepts from probability review part one. And particular random variables, their expectation, and linearity of expectation. That said, let's move on and formally define what the selection problem is. The input is the same as for the sorting problem, just you're giving it array of indistinct entries. But in addition, you're told what order statistic you're looking for. So that's going to be a number I, which an integer between 1 and N. And the goal is to output just a single number. Namely the ith order statistic, that is the ith smallest entry in this input array. So just to be clear, if you had an array entry of let's just say 4 elements, containing the numbers 10, 8, 2 and 4. And you were looking for, let's say, the 3rd or a statistic that would be this 8. The first order statistic is just the minimum element of the array. That's easier to find with a linear scan. The nth order statistic is just the maximum, again easier, easy to find with a linear scan. The middle element is the median. You should think of that as the canonical version of the selection problem. Now when n is odd, it's obvious what the median is, that's just the middle element, so the n plus one over 2th order statistic. If the array has even length, there's two possible medians, so let's just take the smaller of them, that's the n over 2th order statistic. You might wonder why you'd ever want to compute the median of an array rather than the mean, that is the average. It's easy to see you that you can compute the average just with a simple linear scan. And the median you can, one motivation is it's a more robust version of the mean. So if you just have a data entry problem and it corrupts one element of an input array it can totally screw up the average value of the array, but it has generally very little impact on the median. Final comment about the problem is that I am going to assume that the array entries are distinct, that is there's no repeated elements. But just like in our discussions of sorting, this is not a big assumption. I can encourage you to think about how to adapt these algorithms to work even if the arrays do have duplicate. You can, indeed, still get the same very practical, very fast algorithms with duplicate elements. Now if you think about it, we already have a pretty darn good algorithm that solves the selection problem. Here's the algorithm. It's two simple steps and it runs in o of n log n time. Step one, sort the input array. We have various subroutines to do that. Let's say we pick MergeSort. Now, what is it we're trying to do? We're trying to the ith smallest element of the input array. Well, once we've sorted it we certainly know where the ith smallest element is, it's in the ith position of the sorted array. So that's pretty cool, we've just done what a computer scientist would call a reduction and that's a super useful and super fundamental concept. It's when you realize that you can solve one problem by reducing it to another problem that you already know how to solve. So what we just showed is that the selection problem reduces easily to the sorting problem. We already know how to solve the sorting problem n log n time so that gives an n log n time solution to this selection problem. But again remember the mantra of any algorithm designer worth their salt, is can we do better. We should avoid contentedness. Just because we got nlogn we should stop there. Maybe can be even faster. Now certainly we're going to have to look at all the elements in the input array, in the worst case. You shouldn't expect to do better than linear, but hey, why not linear time? Actually if you think about it, we probably should have asked that question back when we were studying the sorting problem. Why were we so content with the end login time bound for merch sort. And the O of N login time on average bound, for quick sort. Well it turns out, we have a really good reason to be happy with our N login upper bounds for the sorting problem. It turns out and this is not obvious, and will be the subject of the optional video. You actually can't sort an input array of length N better than N log n time. Either in the worst case or an average. So another words, if we insist on solving the selection problem via a reduction to the sorting problem then we're stuck with this N log N time bound. Okay, strictly speaking that's for something called comparison sorts, see the video for more details but the upshot is if you want a general purpose algorithm. And we want to do better than N log N for selection we have to do it using ingenuity beyond this reduction, we have to prove that selection is a strictly easier problem then sort it. That's the only way we're going to have an algorithm that beats n log n. That's the only way we can conceivably get a linear time algorithm. And that is exactly what is up next on our plates. We're going to show selection is indeed fundamentally easier than sorting. We can have a linear time algorithm for it, even though we can't get a linear time algorithm for sorting. You can think of the algorithm we're going to discuss as a modification of quick sort and in the same spirit of quick sort it will be a randomized algorithm. And the running time will be an expected running time that will hold for any input array. Now, for the sorting problem we know that quick sort that's n log in time on average, where the average is over the coin flips done by the code. But we also know that if we wanted to, we could get a sorting algorithm in n log n time that doesn't use randomization. The merge sort algorithm is one such solution. So here, we're giving a linear time solution for selection, for finding order statistics that uses randomization. And it would be natural to wonder, is there an analog to merge sort? Is there an algorithm which does not use randomization, and gets this exact same linear time down. In fact there is. The algorithm's a little more complicated, and therefore not quite as practical as this randomized algorithm. But it's still very cool. It's a really fun algorithm to learn and to teach. So I will have an optional video about linear time selection without randomization. So for those of you who aren't going to watch that video or want to know what's the key idea. The idea is to choose the pivot deterministically in a very careful way using a trick called the median of medians. That's all I'm going to say about it now you should watch the optional video if you want more details. I do feel compelled to warn you that if you're going to actually implemented a selection algorithm. You should do the one that we discuss in this video, not the linear time one. because the one we'll discuss in this video has both smaller constants and works in place. So what I want to do next is develop the idea that can modify the QuickSort paradigm in order to directly solve The selection problem. So to get an idea of how that works, let me review the Partition subroutine. Like in Quicksort this subroutine will be our workhorse for the selection algorithm. So, what the Partition subroutine does, it takes as inputs, some jumbled up array and it's going to solve a problem which is much more modest than sorting. So in partitioning, it's going to first choose a pivot element somehow. We'll have to discuss what's a good strategy for choosing a pivot element. But suppose in this particular input array it chooses the first element, this three, as the pivot element, the responsibility of the partition sub-routine then is to rearrange the elements in this array so that the following properties are satisfied. Anything less than the pivot is to the left of it and it can be in jumbled order. But if you're less than pivot you better be to the left like this two and one is less than three. If you're bigger than the pivot than again you can be in jumbled order amongst those elements but all of them have to be to the right of the pivot and that's true for the numbers four through eight. They all are to the right of the pivot three in a jumbled order. So this in particular puts the pivot in its rightful position, where it will belong in the final sorted array. And at least for Quicksort, it enabled us to recursively sort to smaller subproblems. So this is where I want you to think a little bit about how we should adapt this paradigm. So, suppose I told you the first step of our selection algorithm is going to be choose a pivot and partition the array. Now the question is, how are we going to recurse? We need to understand how to find the ith order statistic of the original input array. It suffices to recurse on just one sub problem of smaller size, and find a suitable or a statistic in it. So how should we do that? Let me ask you that with some very concrete examples. About what pivot we choose and what order statistic we're looking for and see what you think. So the correct action to this quiz is the second answer. So we can get away with recursing just once, and then this particular example, we're going to recurse on the right side of the array. And instead of looking for the fifth order statistic like we would originally, we're going to recursively search for the second order statistic. So why is that? Well first why do we recurse on the right side of the array? So by assumption we have this array of ten elements, we choose the pivot, we do partitioning, remember the pivot winds up in its rightful position. That's what partitioning does. So in the bid it winds up in the third position, we know it's the third smallest element in the array. Now that's not what we were looking for. We were looking for the fifth smallest element in the array. That, of course, is bigger than the third smallest element of the array. So by partitioning, where is the fifth element going to be? It's gotta be to the right of this third smallest element, to the right of the pivot. So we know for sure that the fifth order statistic of the original array lies to the right of the pivot. That is guaranteed. So we know where to recurse on the right hand side. Now, what are we looking for? We are no longer looking for the fifth order statistic, the fifth smallest element. Why? Well we've thrown out both the pivot and everything smaller than it. Remember we're only recursing on the right hand side. So we've thrown out the pivot, the hird element, and everything less than it, the minimum and the second minimum. Having deleted the three smallest elements and originally looking for the fifth smallest of what remains, of what we're recursing on. We're looking for the second smallest element. So the selection algorithm in general, is just the generalization of this idea. So arbitrary arrays and arbitrary situations of whether the pivot comes back equal to less or bigger than the element you are looking for. So let me be more precise, I am going to call this algorithm R select for randomized selection, and according to the problem definition it takes as input, as usual an array A of some length n. Then also the order statistic that we are looking for, so we are going to call that i, and of course we assume that i is some integer between one and inclusive. So for the base case, that is going to be if the array has size one, then the only element we could be looking for is the oneth order statistic and we just return the sole element of the array. Now we have to partition the array around the pivot element. And just like in quick sort, we're going to very lazy about choosing the pivot. We're going to choose it uniformly at random from the n possibilities, and hope things work out. And that will be the crux of the analysis, proving that random pivots are good enough sufficiently often. Having chosen the pivot, we now just invoke the standard partitioning and subroutine. As usual, that's going to give us the partitioned array. You'll have the pivot element, you'll have everything less in the pivot to the left, everything bigger, to the right. As usual, I'll call everything to the left, the first parts of the partitioned array. And everything bigger, the second part. Now we have a couple of cases, depending on whether the pivot is bigger or less then the element we are looking for. So I need a little notation to talk about that. So let's let j be the order statistic that p is. So if p winds up being the third smallest element like in the quiz, then j's going to be equal to three. Equivalently we can think of j as defined as the physician of the pivot in the partition version of the array. Now there's one case, which is very unlikely to occur, but we should include it just for completeness. If we're really lucky, then, in fact, a random pivot just happens to be the order statistic we were looking for. That's when i equals j. We're looking for the ith smallest element. If by dumb luck the pivot winds up being the ith smallest element, we're done. We can just return it. We don't have to recurse. Now in general of course, we don't randomly choose the element we are looking for. We choose something that, that could be bigger or could be smaller then it. In the quiz we chose a pivot that was smaller then what we were looking for. Actually, that's the harder case. So, let's first start with a case, where the pivot winds up being bigger then the element we were looking for. So that means that j is bigger than i. We're looking for the i smallest. We randomly chose the j smallest for j bigger than i. So this is the opposite case of the quiz. This is where we know what we're looking for has to be to the left of the pivot. The pivot's the j smallest everything less than is to the left. We're looking for the i smallest, i is less than j, so that's got to be on the left. That's where it recurs. Moreover it clear we're looking for exactly the same order statistic. If we're looking for the third smallest element, we're only throwing out stuff which is bigger than something even bigger hthan the third smallest element so we're still looking for the third smallest of what remains. And naturally the new array size is j minus 1 because that's what's to the left of the pivot. And then finally, the final case is when the random element that we choose is less than what we're looking for and then we're just like the quiz. Namely what we're looking for is bigger than the pivot. It's got to be in the right-hand side. We know we've got a recurse in the right-hand side. Whenever the right-hand side has n minus j elements, we throw out everything up to the pivot. So we throw out j things. There's n minus j left. All of those j things we threw out are less than what we're looking for. So if we used to be looking for the i smallest element now we're looking for the i minus j smallest element. So that is the whole algorithm. That is how we adopt the approach we took to the sorting problem in quick sort and adapt it to the problem of selection. So, is this algorithm any good? Let's start studying its properties and understand how well it works. So let's begin with correctness. So the claim is that, no matter how the algorithm's coin flips come up, no matter what random pivots we choose, the algorithm is correct. In the sense that it's guaranteed to output the ith order statistic. The proof is by induction. It proceeds very similarly to quick sort. So I'm not going to give it here. If you're curious about how these proofs go, there's an optional video about the correctness of quick sort. If you watch that and understand it, it should be clear how to adapt that inductive argument to apply to this select algorithm as well. So as usual for divide and conquer algorithms, the interesting part is not so much knowing, understanding why the algorithm works, but rather understanding how fast it runs. So the big question is, what is the running time of this selection algorithm? Now, to understand this we have to understand the ramifications of pivot choices on the running time. So you've seen the QuickSort videos they're fresh in your mind so what should be clear is that just like in QuickSort how fast this algorithm runs is going to depend on how good the pivots are and what good pivots means is pivots that guarantee a balanced split. So, the next quiz, we'll make sure that you understand this point and ask you to think about just how bad the running time of the selection algorithm could be if you get extremely unlucky in your pivot choices. So the correct answer to this question is exactly the same as the answer for QuickSort. The worst case running time, if the pivots are chosen just in a really unlucky way. Is actually quadratic in the array length. Remember, we're shooting for linear time. So this quadratic is a total disaster. So how could this happen? Well suppose you're looking for the median, and suppose you choose the minimum element as the pivot every single time. So if this is what happens, if every time you choose a pivot to be the minimum, just like in QuickSort, this means every time you recurse, all you succeed in doing is peeling off a single element from the input array. Now, you're not going to find the median element until you've roughly n over 2 recursive cause, each on an array that has size at least a constant fraction of the original one. So it's a linear number of recursive calls, each on an array of size at least some constant times n. So that gives you a total running time of quadratic overall. Of course, this is an absurdly unlikely event. Frankly, your computer is more likely to be struck by a meteor than it is for the pivot to be chosen as the minimum element in every recursive call. But if you really have an absolutely worst case choice of pivots, it would give this quadratic run time down. So the upshot then is that the running time of this randomized selection algorithm depends on how good our pivots are. And for a worse case chose of pivots the running time can be as large as m squared. Now hopefully most of the time we're going to have much better pivots. So the analysis receives by making that idea precise. So the key to a fast running time is going to be the usual property that we want to see in the divide and conquer algorithms, namely every time that recurse the problem size better not just be smaller but it better be smaller by a significant factor. How would that happen in the selection approach based on the partition subroutine? Well if both of the sub-problems are not too big, then we're guaranteed that when we recurse we make a lot of progress. So let's think about what the best possible pivot would be in the sense of giving a "balanced" split, right, so of course in some sense the best pivot is you just choose your statistic group you're looking for. Then you're done in constant time. But that's extremely unlikely, and it's not worth worrying about. So ignore the fact that we might guess the pivot. What's the best pivot if we want to guarantee an aggressive decrease in the input side for the next iteration. Well, the best pivot is the one that gives as most balanced split as possible. So what's the pivot that gives us the most balanced split? A 50/50 split. If you think about it it's exactly the median. Of course, this is not super-helpful, because the median might well be what we're looking for in the first place. So this is sort of a circular idea. But for intuition, it's still worth exploring what kind of running time we would get in the best case, right? If we're not going to get linear time even in this magical best case, we certainly wouldn't expect to get it on average over random choices of the pivots. So what would happen if we actually did luckily choose the median as the pivot every single time? Well we get the recurrence that the running time that the algorithm requires at a rate of length n. Well, there's only going to be one recursive call. So this is the big difference from QuickSort where we had to recurse on both sides and we had two recursive calls. So here, we're only going to have one recursive call. In the magical case where our pivots are always equal to the median, both sub-problem sizes are only half as large as the original one. So when we recurse, it's on a problem size guaranteed. Could be at most n over two and then outside the recursive call pretty much all we do is a partitioning invocation, and we know that that is linear time. So the recurrence we get is T of N is the most T of N over two plus big O of N. This is totally ready to get plugged into the master method. It winds up being two of the master method and indeed we get exactly what we wanted, linear time. To reiterate this is not interesting in its own right. This is just for intuition. This was a sanity check that at least for a best case choice of pivots we'd get what we want, the linear time algorithm and we do. Now, the question is how well do we do with random pivots? Now the intuition, the hope is exactly as it was for QuickSort which is the random pivots are perfectly good surrogate for the median, for the perfect pivot. So having the analysis of Quicksort under our belt where indeed random pivots do approximate very closely to the performance you get with best case pivots maybe now we have reason to believe that this is hopefully true. That said, as a mathematical statement this is totally not obvious and it's going to take a proof. That's the subject for the next video. Let me just be clear exactly what we're claiming. Here is the running time guarantee the random Rselection provide. For an arbitrary input array of input length n, the average running time of this randomized selection is linear. Big O of n. Let me reiterate a couple of points I made for the analogous guarantee for the QuickSort algorithm. The first is that we're making no assumptions for data whatsoever. In particular we're not assuming that the data is random. This guarantee holds, no matter what input array you feed into this randomized algorithm. In that sense, this is a totally general purpose subroutine. So where then does this averaging come from? Where does the expectation come from? The randomness is not in the data, rather, the randomness is in the code. And we put it there ourselves. Now let's proceed to the analysis.