You've just learned that while more complex models can capture patterns and trends in the data with less error, they can potentially over-fit the training side, leading to even more error when generalizing to new data. You learn the importance of splitting off a subset of your data for final testing, as well as using validation data to prevent choosing a model that over-fits. Now let's see how to do this in MATLAB using trip duration prediction as an example. To get started, import the taxi data for the month of May. You'll want to do so without any preprocessing, since you still need to split off the test set. For reproducibility, let's set the random number generator seed. This way if you re-run the script in the future, you won't get a different test in training split. To create the partitions, use the function CV partition. The arguments should be the height of the data, the holdout method, and the fraction of the data to use as a test, let's say 20 percent. This will give you a CV partition object that can be used to get indices for taxi and training data. Extract the test indices from this object using the function test. Next, use the result to extract the taxi data by indexing into the original table. Remember, you'll set this data aside for the time being. Similarly, get the training indices using the training function on this object, and extract the training data by indexing into the original table. Finally, before you train any models, let's apply the basic preprocessing function to the training data. Add the time of day feature and the day of week feature. Now let's open the regression learner app and start a new session. Select the training data as the data-set variable, and duration as the response variable. Next, select distance, pickup longitude, pickup latitude, drop-off longitude, drop-off latitude, time of day, and day of week as predictors. Finally, for later comparison, when you use validation data, let's start with no validation selected as you've been doing up until now. Training using the all trees option produces coarse, medium and fine trees. Examining the results, the fine tree has the best RMSE. However, these metrics were calculated using the same data that was used for training the models. This means we may be over-fitting. Let's see how this compares to the results when using validation data. To do this, start a new session. The data-set response and predictors will all be the same, but this time you'll use validation. Although you can use cross-validation since the data has over 200, 000 rows, holdout validation should be sufficient and faster. So let's try this with say, 20 percent of the data. Train the same tree models as before. This time, however, the metrics will be calculated using the 20 percent holdout validation data. Now, the coarse tree has the best RMSE. As suspected, the fine tree was over-fitting the training data and the coarse tree generalizes better. In this video, you saw how to use validation data in the Regression Learner app to better evaluate model performance and prevent choosing an over-fitting model. Next, you'll learn additional methods and training options to create models with even better performance.