Hi, I'm Juan Vergara. Welcome back. We're back from putting three brain versions to train. After waiting for our training sessions finish, we are good to review the results now. Spoiler alert. When training controls with Bonsai, it is always important not to run too many experiments at once. Also, ensure your experiments build on top of the previous ones piece-by-piece. If you change too many things at once, it will be hard to understand the individual impact each change had on the final experiment, or even worse, you might jump to the wrong conclusions and drift from optimal machine teaching strategies for your specific use case. Let's do a quick recap of the experiments that we have planned on the experimentation by brain thus far. First, we have an initial version of the brain with the first state action definition that came to mind, as well as the simplest goal struct that was given to us. There wasn't a specific hypothesis for our first experiment. The only basic hypothesis is that you expect your brain to be able to learn a meaningful policy. Otherwise, we would have added these states or modified the actions as needed to ensure first, there is enough signal on the states for the brain to take applicable and meaningful actions, as well as two, that the actions have the expected effect on our simulation environment. The second version of our brain merge the left and right actions into a single action called engine2. Our hypothesis for this experiment is that simplifying the action space, should, help the brain learn beyond the initialization weight at iteration zero, something our first brain didn't do. Last but not least, our third version of the brain focused on defining an episode length that was sensible to this problem. We defined a range double from the average iteration crash value we noticed on v01 of 79 iterations. Our hypothesis for this experiment is that the reduction of the episode iteration limit value should help the brain identify expert policies more quickly. Additionally, this version built upon the aggregation of left and right engine thrusts into a single engine2 action. We expect v03 to be our most robust brain trained so far. Let's us start by looking at v01. Click over the first version and go to the train tab. You can see that the brain training progressed at the very end of our finished training session, which we started after increasing the No Progress Iteration Limit. You might question why your training session stopped despite a new champion being found at the very end. Why didn't the brain training continue after all? Training sessions are scheduled throughout brain training. If we reach the No Progress Iteration Limit, and none of the previous testing checkpoints happened to defeat the last champion, then training is to stop. Yet to not lose any improvements to the point that the training is stopped, Bonsai performs a final evaluation of your brain. If a new champion is found at that point, the brain weights are saved as the new champion. Then it's up to the user to decide whether further increasing the No Progress Iteration Limit value, even if by a single iteration to continue the training from that last saved checkpoint. We already looked at both the goal satisfaction graph as well as the episode iterations graph in the past. Yet, there's other graphs available for analysis. Before we go to these other graphs, let's have a last look at the goals satisfaction graph for the void goal specifically. The value goes from 7.9 percent at the very beginning to 15.2 percent at our last checkpoint, the final Champion at 500,000 iterations. Goal satisfaction computation is intended to display continuous linear progress to the user, our avoid performance can be interpreted as follows. Our brain was able to avoid crashing 7.9 percent of the time. What time you might think? Well, the episode iteration limit, which is 1,000 iterations for this brain version. Remember, average episode length was 79 iterations in the previous video> Well, you now can see how these two values are related. 7.9 percent of 1,000 iterations is exactly 79 iterations the time that was taken for the brain to crash, as seen in the episode iterations graph. We didn't have a look at the goal robustness as success graphs earlier. Let's focus for now on the success graph. Click over goal satisfaction on the left, and select success. You will quickly notice that success displays lower values than goal satisfaction for our experiment. Our success rate is zero for avoid crashing. Why you may ask. Well, success is designed to be a binary indicator of success. There's no gray areas in terms of success. Each episode succeeds if and only if all goals are satisfied throughout the episode from first iteration to the last. Otherwise, the episode counts as a failure. That's why overall success and avoid crash display both values of zero. You might wonder then the minimize objectives show non-zero values that are really high. How is that even possible? Well, it seems minimize angle is succeeding in 50 percent of the episodes. What that means is the following, 50 percent of the evaluation episodes succeeded at minimizing the angle within the defined range. Note, we don't double account for episode failures. The ship might have crashed or not. In this case, we know that they all failed at avoiding crashing, yet so it seems that for the length of the episode, prior to crashing, 50 percent of the episodes succeeded at maintaining angle within the defined range. What this is telling us is that the objective is not too complex nor too hard to accomplish even for an initial random control. But let's not worry about that fact for now. Goal satisfaction and success are great indicators of brain progress yet they are not always comparable across brain versions. For example, the following two modifications would affect the final computation even for 14 trained brains. Modifying the range of any goal will affect the computation of goal satisfaction as well as success. Equally, adding or removing a goal affects the computation of goal satisfaction, as well as success too. That is the reason why it is much better to look at KPI metrics that are independent of the goal definition. We have already mentioned the importance of having a single KPI metric for brain evaluation and we will be underlining that importance in a future week. But before that, we will need to extract the logs from log analytics and we want to make it easy for you this first time around. For now, let's just look at our minimize metrics. Click over success and set the minimize mean value. These values are persistent across experiments as long as you don't modify the computation of the value to be placed within their desired range. Even if you do change the target ranges to be satisfied, these values will remain the same. To easily review the results, we encourage you to follow a simple format in a spreadsheet. In this spreadsheet, you want to track the brain name and version, a brief description of the experiment the notes we are adding would do, as well as the key metrics to evaluate success. On our table, we include goal satisfaction and success, as well as episode length, individual avoid satisfaction as defined within goal satisfaction graph, as well as mean and final values for the minimize objectives. If you were to decide to extract the same values, your table should look like the following one. As you can see, we extracted the mean and final values for each of the three minimize goals. We're good to move to Version 2 now. Clicking over "Version 2" and ensuring you are on the train top, we are presented with a graph that looks a bit better than the previous one. In this case, we have two champions that beat the initial random policy. The second champion has a goal satisfaction of 48 percent for the avoid crash objective. This translates to an average of approximately 480 iterations or episode during our testing evaluation. In case you don't trust me, let's look at the episode iterations graph to double-check this is the case. Clicking over "Episode iterations" on the left axis, where we can now hover and look at the episode length. Indeed, we see the value is 483 durations. Let's have a quick look at the success tab as well. Here we can see that success went up to 13.33 percent, which means more than one episode fully succeeded at not crashing for the full episode length of 1,000 iterations. Actually, since each evaluation test consists of 30 episodes by default, we can quickly identify that it had to be five successful episodes against 25 failed episodes. Time to extract the same metrics we read for our first version. Extract goal satisfaction, success, and average final values for minimize objectives. If looking into the same values that we are, your table should now look like the following. Note these metrics will be affected by the episode length indirectly. Our Version 2 experiment has an average value of 0.09 for minimize angle, whereas our v01 experiment has a value of 0.16. But looking at the final value, we see both experiments have a similar final angle value of 0.12. This must mean that the second brain version is learning to oscillate around 0.09 average distance to vertical position. The amplitude must be at least as high as up to 0.13, which is the final value found for angle after a mean of 483 durations per episode. Let's jump into the last version. At last, we have one brain version that succeeds at all goals. We reach the 100 percent goals satisfaction at 346,000 iterations of the training session. Note, when you reach 100 percent goal satisfaction, you are implicitly also reaching 100 percent success. There's no way you can succeed at 100 percent of the testing iterations, the continuous progress displayed on the goal satisfaction graph, while not hitting 100 percent of the binary episode success criteria, the per episode binary progress indicator that the success graph we present. You might be wondering why the brain continues finding champions after reaching 100 percent goal satisfaction. Well, it is for that reason that we have that goal robustness graph. Click over the "Goal satisfaction" button and select Goal robustness. You can see in this graph the slight improvements that are happening in-between one champion and the next despite the already reached value of 100 percent satisfaction. They all keep succeeding 100 percent of the time, yet some of them are more robust and are able to minimize even deeper within the desired target branches. If you click on the left axis button and go now to minimize mean value, you should also see the actual progress made for those consecutive 100 percent goal satisfaction champions, this time in the units of the simulation. Time to transfer these values to their spreadsheet and add a coloring theme so that we can clearly identify which version is providing the most consistent results. Overall, Version 3 of the current brain provides the best control as per the minimize metrics extractor, as well as a more gradual learning throughout the training session as per the abundance of checkpoints throughout the training session. On this last note, we want to be mindful of the further practicality of having a shorter episode iteration limit. The shorter the episode length, the more checkpoints you will have for evaluating the policy. After all, Bonsai dedicates five percent of the resources for testing, whereas the rest for training. The longer the episode length, the more it will take for your testing sessions to complete. The more you can narrow down the episode length, the more automatic assessment check points you will be able to have and the more likely it will be for you to find a champion throughout your training session. Yet, ensure that you're giving an episode length that's long enough for your brain to have a chance at succeeding at your task, as well as being exposed to the full dynamics of your problem, including any possible delays between action taking and these states being affected by the change applied to that action. It is time for you to investigate whether your results are the same as ours. Do you arrive at the same conclusions? Is there anything that you would interpret differently? Note, initial weights are not always exactly the same. Thus, it is not uncommon for exact same inkling files to produce slightly different trained brains. Be mindful of that when comparing results with peers or even the customer, whether internal or external to your organization.