So now we are going to create natural segmentation. The first thing we'll do is to load the data as usual always the same one. Purchases .60 and then as we've done before, we're going to give the column a name that makes sense, customer ID, purchase amount, date of purchase. We're going to tell R that the date of purchase is actually a debt with the ASdebt function here specifying the format of the debt that is contained in the text file. And, once it's done, we are going to compute two things. We are going to extract the year of purchase, from the date, using the function we've covered already. And then same thing extracts, how many days have elapsed between January 1st, 2016 and each and everything purchase is made in this side. Now it's done, we have exactly the same data as before as you can see with the had function showing you the first few lines of the file. As well as the summary function which shows you a few key statistics which just mean minimum, maximum and signs of fourth. And as you can see here in the days of purchase, we actually compute a minimum, a mean, a maximum, and so on, which means that r has correctly identified it as data that can be computed, averaged, and so on. So, the next thing we are going to do again, we are going to call the library, call SQLDF for SQL data frames, managing scripts segments on our data frames as if they where database tables and we're again to extract a few indicators we're going to use in these modules segmentation. The first one of course is the customer ID of each individual and then we're going to compute recency, which can be defined as the number of days between January 1st, 2016 and the most recent one. We're going to also compute first purchase here, which is exactly the same thing except that instead of taking the most recent purchase, we are going to take the oldest one, and so it translate into taking the maximum number of days between any purchase made by one specific customer and January 1st. Then we're going to compute frequency by counting the number of purchases that customer has made, as well as the average purchase amount made by that customer. We extract the data, the query from the data frame called data, which we've just loaded. And we group by one, meaning that we group by the first variable in the query, which is customer ID. So we'll have only one row of data per unique customer. And now we have our data as you can see, 18,000 customers, 5 variables, to which you can explore as we've done before. Looking at the first few lines, off that new data, computing summary statistics, creating histograms of recency. As you can see here we've already covered that. A frequency amount with a finer grain of and sense of Now, how can you actually create a managerial segmentation in R? Well, the first thing to remember, is that most managerial segmentations are nothing else than if-then-else statements. Meaning that, if we begin with a very simple statement such as, I'd like to call as inactive any customer who hasn't made any purchase for at least three years. That if else statement will do the trick. So if you look at that, customers 2015 is basically the dataset we've extracted with recency, first purchase, frequency, and sum. And you can create automatically a new variable called segment in which we'll store a value. That value will be either inactive or NA. NA meaning, we haven't assigned any kind of segment yet, and depending on the tests we do here, meaning that if customers 2015 viable recency, so is more than three years, then we qualify that customer as inactive. If not we simply don't qualify that customer at all. If you run that, look at that. Data five variables. You run that, and then basically you get six variables showing up, which is the value of the segments. Field, out, best, on, few, if, and else test. Either inactive for those with a recency above three years or na for those with a recency below three years. And of course inactive is just one segment, you can have many more, meaning that you can put multiple if-else statements together to create the entire imaginal segmentation. Wait, so haven't not such a good idea. Let me show you why. One thing you could do is to check how many customers are in each segment with a table, statement, and calling the segment columns. So in the table, you'd get 9,000 customers who are inactive, and 9,000 customers who are not. And then you can also compute averages, with the function aggregate, which we've seen before. So, the data we're going to average are pretty much everything within the customer's 2015 data set. From the second to the fifth variable, meaning recency, first purchase, frequency, and amount. And we group them by a list composed of which segment they belong to. What kind of computation do we apply to that? The mean, meaning we going to compute the mean of all these variables grouped by segment. If you do that, as you can see here, the inactive segment has an average frequency of 2,178 days, an average first purchase of 2,500 days, on average have made 1.8 purchases for an average amount of 48. So as you can see, the most recent customers, those ones, who have a recency lower than 3 years, tend to have made more purchases and for a higher amount on average. Now, I said before that using If-else statements for segmentation may not be such a good idea. Let's see why. Basically, if you'd like to go further, and not only have an inactive segment, but also a called segment, what you have to do is to embed within the if-else statement, another if-else statement. So basically, if customers recency is above three years, then it's a yes. We call that customer inactive. If it's a no, then it depends. Then we take, test the recency. Is recency above two years? Yes, it's gold. No? Then it belongs to one of the other segments yet and you could do that of course if we run and work perfectly. I'm going to run everything in one row. So you have 9,000 inactive customers, as before, you have 1,900 cold customers, you have 7,300 customers who haven't been qualified yet. And you can see that for recency, first purchase, frequency, and sum, you get a nice picture of which people are in which segments and how they behave. But actually, if you have 10, 15, 20 segments, that if else, if else, if else, could be a complete mess and it's extremely easy to screw it up, put an if else statement where it's not supposed to be, can you imagine for instance, that if you have ten segments, it means you probably have eight or nine if else statements each within one another. It can become a real mess extremely quickly. So usually unless you are applying an extremely simple if else kind of structure is a really bad idea, it's very hard to read, very hard to maintain, you are very likely to make tons of mistakes. So what I suggest, as soon as you get to a slightly more complex segmentation scheme, my suggest is to use the which statement. How does it work? Well first of all, I need to reset everything, create a variable called segment. We have it already and set it to everywhere. So, everybody will begin as an NA, not applicable segment everywhere. So we've created the variable already, and we can work with it. That's the key. Now, for each and every customer, we're going to look at which match that condition, which have a recency above three years. And for those ones, and those ones alone we'll call them inactive and that's the which here, which is the key. If you do that, you get exactly the same thing that we got before, okay? 9100 inactive customers, 9200 not applicable segments yet with the same averages and so on, so forth. But now it's much easier to go into slightly more complex segmentation solutions because basically each segment definition becomes it's own line and it's much easier to follow a quite complex structure. So here we do exactly the same thing but we'll work with a few more segments. For instance, we're going to say that any customer with a recency above three years is inactive. Any customer with a recency below three years, but above two, is called any customer with a recency of less than two years, but more than one will be classified as warm and then any customer with a recency of less than a year will be qualified as active. And if you do that, it's actually much easier to follow through a complex segmentation. Here as you can see all customers have been qualified as either active called, inactive or warm. As you can see the recency matches perfectly by definition the active customers have a recency of less than a year. They have been quite active 4.56 purchases of the lifetime. Quite high average purchase amount and sense of worth and you have your four segments here. The only trick if you use that which statement to create a segmentation of your own is to make sure that a, you do not forget anyone that if you apply all these segmentation rules, everyone would be qualified by at least one segment. And that everyone is qualified by only segment. Meaning that for instance here, if I removed that, then basically I would make a huge mistake. Here, every customer with a recency of above three years would be qualified as inactive. And here, every customer with a recency of above two years would be qualified as cold. Well, it so happens that if you match that criteria you will match that one as well. Meaning that none of your customers will be inactive. They will all be cold and the inactive segment will just disappear. So you need to make sure that your segment definition do not overlap and that everyone is going through the segmentation process nicely. So how does our complete Marginal segmentation look like? Well actually, it looks like that. And it's actually not that hard to follow if you remember what the marginal segmentation looks like. We need to reset everything to NA then, we are going to qualify based on recency alone, here. Inactive, cold, warm, and active customers. The trick here to make the segmentation easier to manage. So for instance if you take all those customers who are active, meaning they have made a purchase within the last year at least. And you use that segment qualifier as a condition within the next few lines, such as here for instance. You can actually make life much easier. So here, I'm going to select all the customers who've had segments set to active for the time being and who've made the first purchase within the last year. Well, if they are active and made a purchase in the last year, they are new active. And so I'm requalifying the segment here based on the segment that they had before and first purchase date. Or, here as amount. So if you are qualified as active and you've made an average purchase of less than $100, then you are qualified as active low value. If it's above 100, then it's active high value. And remember, your segments cannot overlap. And they need to cover everybody. So, if I did that for instance, let me remove the equals sign, it means that some customers whose average purchase amount is exactly $100, will fall in neither segment which is bad. So you have your entire segmentation working here. I'm going to apply this whole code together. You have the number of customers in each segment. For instance, in terms of a new active, we have 1500 customers here with a pretty nice explanation of the average Behavior of customers within each segment. For instance the active high value customers have an average recency of 88, 89 days, made a first purchase 2000 days ago, made on average 5.6 purchases of the lifetime yet and have purchase amount on average around $240 and that's your entire segmentation here working pretty well. As you can see it's much more than if you had if else statements put together into long list of Intermingled statements. There is a few more things I'd like to do with you before moving to the next stage of this course. And that's here, as you can see, the name of the segments is alphabetical. We go from active high, active low, CINNWW. It's not really relevant from a maginal point of view, so we're going to reorder these segments in an order that makes sense. So it can be more easily read by managers and yourself. So I'm going to take the segment variable here. Say it's a factor with the exact same variable. So I'm not changing anything. I'm just specifying that the levels of that factor are, in that order inactive, cold, warm, warm low, new warm, active high and so on and so forth. It doesn't change anything except where the values are stored internally in a specific order. And once you rerun that, you get the least active segments at top the most active reasons, segments at the bottom and you can see much easier to read and to analyze.