Our focus with this module is to understand how to do less work and still get mostly the same amount of information, as if we had done all the work. A bit of educated guessing is required, and some assumptions are used along the way. Now, do you remember that rule that when we were dealing with a system with "k" factors, and there are two levels for each factor, that we will have 2 to the power of "k" experiments? That's a lot of experiments in many cases. We saw that in the prior module, that when we used the software, we could estimate all those coefficients. The key insight that you will take away from these videos is that we don't have to run all those experiments. We can do fewer, but there's going to be a price to pay; and we're going to figure out what that price is in this video. Here's an experiment with two factors at two levels and there are the four parameters that we can estimate. The intercept, the main effect of the first factor, the main effect of the second factor and the two factor interaction between the two. Here is a system with three factors, and as we can see, we can estimate eight parameters after we have completed the eight experiments. A system with four factors will have a total of 16 experiments in a full factorial. Such as system will have 16 parameters that we can estimate using computer software. You can probably appreciate that this procedure quickly becomes prohibitive for most practical systems. There are many systems where there are 6, 7, or more factors. We do not want to perform so many experiments required by the full factorial. It will be both time prohibitive and cost prohibitive. This is even true for systems that can be highly automated, e.g. systems with DNA sequencing or systems that are done using computer software and stimulation. There is also very little use in estimating all 2 to the power of "k" coefficients, that's many, many coefficients in some experiments. These higher order interactions are non-existent, and many of those coefficients will be so small, that they're practically zero. You'll seldom see a 3 factor interaction that is actually present in a real system. And a 4th order, and higher level interactions, almost certainly don't exist in practice. By using some educated guessing, and making reasonable assumptions about our system, we are going to figure out a way to do fewer experiments and still retain the essential information of the important effects in our system. At the core of this approach is an implicit assumption that we ignore these higher-order coefficients in the model. There are occasions when it is appropriate to do that, and there will be times when our assumptions are faulty. It is critical to understand that there are practical situations where it's quite okay to lose some of this prediction accuracy from the higher-order terms. Those higher-order terms definitely helped you fine tune the predictions but the cost of obtaining them can be prohibitive. You'll need to decide whether or not it is worth doing that work. And that's the subject of today's video. Perhaps let me ask you to consider the question this way: if we only had the time and a budget to do 4 experiments, which 4 of these original 8 would you do? You might start by considering to only run the 4 experiments here at the front, but that won't work so well because you will only have factor C at its low level. There will be no experiments at the high level for factor C, and so you won't really know what factor C does in the system. So then you might say: "what if I select these two at the front and those two at the back?" Those represent the middle four rows from the standard order table. That's not a bad choice, but it's not the best. Let me show you a better choice then I will explain it afterwards. Here is the set of 4 experiments that you should do. Either select the 4 with open circles or the 4 with closed circles. Notice the interesting pattern in the cube. It is intentionally selected that way and let me explain why. We'll work backwards here. Assuming we have completed these 4 experiments - the 4 with open circles. And now when we analyze the data we discover that factor A is not significant from the Pareto plot. If A is not significant then it essentially implies that we could have ignored factor A, and never really needed to include it in our experiments. Another way of saying that, is that factor A could have been at the - level or at the + level, and it really wouldn't have affected our outcome variable much. If A can exist at two levels and not really affect our outcome, that means that we can collapse the minus and the plus layers together. And notice then what happens. As we do that, we recover 4 experiments in factors B and C. Four experiments in two factors; that's a full factorial! We don't have to do any more work here. These four experiments that we've already run, now complete a full factorial in factors B and C. In fact you can prove this to yourself for the case when factor B is not significant. Then it collapses to a full factorial in factor A and factor C. If factor C is not significant then it collapses to a full factorial in factor A and factor B. So from that perspective, these are really a good set of 4 experiments to use. So now let's imagine that we've run only these 4 experiments. I'd like to show you how we could analyze the data and I'm going to use the water treatment example again. I hope you don't mind if I rename the factors to A, B, and C. I'm doing this because I want to use the water treatment example that you're comfortable with, but at the end, I want to extend what we learned here today to any system, and A, B, and C are the most generic way to do that. Now assume that each of these experiments were very expensive. Maybe they cost around $10,000 each. So instead of doing 8, let's assume we've only done these 4: half the work. Our boss is going to be pretty impressed that we've saved $40,000. Open the software and let's see what happens. Using the best choice design I talked about earlier, where you've only done experiments 2, 3, 5 and 8 from the original set, I'm going to ask the software to create new variables for A, B and C, which only include those 4 experiments. And here are the 4 outcomes at those conditions. Now if you just go ahead and type in the code from the previous class, you can see that the software will create a model from A, B and C; and it includes 2 and 3 factor interactions. But what you will notice that's different from last time, is all these NA terms. That NA stands for "Not Applicable"; those terms cannot be estimated. But we got 4 estimates of 4 coefficients, we ran 4 experiments so we expected that. The full model prediction has 8 parameters and would have required 8 experiments to calculate all 8 of them. Let me assume we've done all 8 experiments. And let me compare that to the case where we've only done 4 of the experiments. We're going to write out the two prediction models side-by-side so that you can see the differences between them. In this particular example, you can see that three of the terms are numerically similar; it's not going to lead to serious misinterpretation. However, there is one term that is very different. What has happened over there? I'm going to show you now how that reduced design was found. How did we come to that best choice? We call this a half fraction. The full set of experiments for 3 factors would've required 2 to the 3 experiments. If we want to do half the work, then we can divide by 2 here, which is equal to 4. Or for those of you that remember your exponent rules, we could write this as 2 to the power of (3 minus 1). This equals 2 to the power of 2, which equals 4. There is a systematic way to select those four runs. Since we know that we will have 4 experiments, we can quite happily go ahead and write out our standard order table for the first two factors, A and B. We do this because we know two factors require 4 experiments. Okay, but what about that third factor, factor C? At what settings should we write out that factor? We write it out as C equals A times B. In fact, we say "generate factor C as A times B". So there we have that factor C is equal to +, -, -, + for the 4 experiments; the multiplication of the values in column A and column B. Let's visualize where those 4 points are on the original cube. The first row is at low A and low B, and high C, so it appears here. The next point is that high A, low B, and then low C. So that's over here. The third experiment is there, and the last experiment is at high A, high B, and high C. Notice how that corresponds to the ideal selection of four experiments we made at the start of this video. In the next video I'm going to show you where I got that rule where C should equal A times B. So let's understand the trade off here. If we do half the amount of experiments we have to accept that we get less information from the system. I guess you can say there's no such thing as a free lunch. You can't get something for nothing. The question is: "what is the penalty for doing fewer experiments?" "What is this free lunch costing me?" I mean, if we had paid an extra $40,000, and did the extra four experiments we'd have that extra information. You can already see that over here. We had some good estimates of the three parameters. The intercept, the A main effect, the C main effect. But the B main effect was actually quite wrong. Also you notice that we didn't get any estimates of the two-factor interactions. Let me drop in two words that we will come back to in later classes. "Screening" and "optimization". When we are screening, we don't mind having reduced knowledge of the system. For example, we don't mind if the two-factor interactions are not all known, or if the estimates of the factors are not quite correct. Later on, when optimizing, though, we want more specific information about the system: a better level of prediction accuracy. At that point is when we will require better resolution of the main effects and interactions. So this is what the $40,000 is costing us: a reduction in the model's prediction quality. You could ask whether that's worth the money saved. Well, you'll never really know the correct answer, unless you do the full set of experiments. But I'm going to show you how we can make some educated guesses later in this module. What we've done here by not running those extra experiments, is, we've rather cleverly selected a subset of them to save $40,000. We can use this money later on. For when we required a more detailed model to find that optimum in the system. George Box, the famous statistician from whose text book, we're using this example, said is a rough rule, but only a portion about 25% of the experimental efforts and budget should be invested in the first experimental designs. I paraphrased that slightly. But basically he is saying that you should leave some money, and time, for later on to figure out the details. In the beginning you don't even know yet if A, B or C are actually significant. First figure that out before you go build a detailed model, with two-level and three-level interactions. That's where we're going to leave the class today. We've shown you the end point that when you do half the work, you lose a bit of accuracy in your model but there's a great built-in backup strategy in the clever selection of which half of the work to do. I guess you could say at least be smart about which half of the work to do. In the next class we're going to learn the technical terms and the mechanics around creating these half-fractions.