Welcome back. As we've been talking about ways of, gathering information about your user interface from users. We've spent a lot of time talking about, user testing and ways that get directly involved with one user at a time. In this interview with Ronny Kohavi from Microsoft, we want to talk a little bit more about massive scale testing. And particularly the Industry Practice of Massive A/B Testing. Now Ronny is a distinguished engineer at Microsoft and general manager of analysis and experimentation. He also has a history in the field as somebody who, brought in concepts into practice in numerous companies, around the idea that we're going to try different things on different people and do that at a large scale. And so Ronny, welcome, we're delighted to have you here. >> Thank you, delighted to join. >> So you're a well-known expert in, and really an advocate For massive online AB testing. For those who may not be familiar with the concept, could you briefly explain to our learners what that is? >> Sure, so the concept is pretty simple. You've got users who are visiting, say, your website, but this could be using a client application, or a service, or anything else, you randomly split the population, into a control, and one or more treatments. We will use, normally, av custane which implies A is the control and B is a single treatment, although this generalizes to what's called ABN or more generally, controlled experiments. The users interact with your system of interest, they new instrument their activities whether it's clicks, whether it's hovers, whether it's tiny events that happen in certain times. And then these, instrumentations flow into an analysis system, that looks at typically averages but could be other metrics of behavior. And then the statistical tests to determine if the differences between the control and treatment are statistically significant. And the whole point of controlled experiment is that. Everything external to the system is the same, if it's a good day on the stock market or a bad day or there's a holiday or anything else that's happening is exogenous so the fact I've randomized people into two groups, allows me to say that if there is a statistical significant delta, then with high confidence, the fact that. The only change I've introduced between the A and the B is the feature of interest, it caused the metrics to be different, so this is key here, we are using the scientific method to prove that there is a cause and effect of our change on our metrics. >> If you rule everything else out, then what's left must be the truth, so so how does this technique work for user interface design and usability. Could you give an example or two where this kind of testing might improve user interface? >> Yes, so I've been doing control experiments online experiment especially. For many, many years. A little bit for small company called Blue Martini that I started. We did this in email where we sent some emails. And from a user prospective the question is, what do you show in the email? Is it a different headline? Are images good? A lot of these questions can help. Then at Amazon, we used a lot of experimentation to drive both the UI and say how does the homepage look, the product details pages and of course the back end algorithms like recommendation algorithms. At Microsoft we used experimentation across multiple groups but most obviously, at Bing where on a given day you will see something like 200 to 300 different treatments running. And these are all concurrent, so lots of experiments are running. You know examples, I'll kick a few examples just to see, to show things that have evolved over time. And a somewhat old example, that I really like to share is this idea that somebody had on the MSN home page, there's a link to, at the time it was Hotmail, now it's called Outlook.com, which is the web email that Microsoft provides and somebody says, well if they're on the MSN homepage and they click on that, instead of just open. Opening it in place. Why don't we open it in a new tab. Very simple idea. It's a few characters of HTML to implemented. It was very well debated about, Is this the right thing to do? But we ran the experiment and the results were just unbelievably good, in the sense that all our metrics about people using MSN increased, people coming to Hotmail or Outlook through MSN, a lot of the metrics really did improve, and that started a whole, series of experiments around when we should or should not open links in a different tab or a different window depending on your browser. So that was a good example of something that has actually started outside the US. They ran this experiment in the UK. The results were very positive. Then we replicated it in the US so that the results were very positive. And then increased the areas where we were using this technique. You gotta be careful. You don't want to annoy the user by opening too many tabs. But our current thinking is that, if the users go in their feature, and then will close out of that and wants to go back to where they were, then it makes sense. And MSN sort of homepage, is the default homepage for a lot of people. It makes sense that when they're done with their email, they're going to close that tab going back to the MSN homepage, on their prior tab makes a lot of sense, very useful. So we shipped that. So that was certainly an example of something very controversial, and the experiment had very, very clear results. >> Right, I love that kind of example. Because you can picture a group of people sitting around, debating both sides of it. Well, you're disorienting them, putting them into a tab. And then you come in with well, actually we ran the data and we know. Which cuts off a lot of that impassioned debate. >> The second example has to do with just using fund colors. This is one of those things that people have very strong opinions about. There are company standards concerning what fonts and colors to use. Usually, it is determined by designers who are working on this with what they believe will be good. But we have the opportunity to actually test things. An example of something we did is we said, what should we emphasize on the page by making higher contrast, or lower contrast. And it turns out that on Bing, for example, when you try the search results page, if you lower the contrast on the snippet that's shown below the results it helps people scan the page faster, and when we did some experiments and we tested a whole slew of scales and colors, we were able to make dramatic impact to some of our key metrics, like the time it takes for users to be successful, whether that means clicking on some place where they don't hit back quickly. People were using our system more so, a good example of something again simple, test a whole bunch of colors. See which one works best according to your metric. You gotta be careful, you know there's a sort of famous story about some designer at Google that was frustrated because he was forced to test 30 shades of blue. We believe this has been very useful. And we've been able to tune the page and actually see key metrics improve by changing things like color. I'll share another third example which is, again driven by design. There was a very big push a couple years ago To introduce a three column layout, so normal search results on bing and google have the main column which is typically shown on the left capturing 22/3 of the screen or something like that. This is the second column, which we have adds or other things shown on the right, related searches. And so there was a proposal to do three column layout. Show some social networking activity in the third column. Considered to be this canvas for starting to experiment with what could we show in a third column. And lots of people were excited about this. Lots of ideas happened. And we just tried it, and tried it, and tried it, and nothing worked. Meaning all our metrics told us users just don't like it. And you know, you ask yourself is this the privacy effect, people are used to the old way and will take them a long time, so we ran some experiments for very long. When you go to Bing today you will see that we went back to two colors because non of the ideas that we try for months actually work. So the good thing I like about experimentation is sort of, here's our users telling us what they like and don't like through their actual behavior. You can have a lot of opinions in the room, I'll actually show in the picture we use this Hippo little icon to share with people the what's called the highest paid-person's opinion. >> [LAUGH] >> So that's the acronym for HIPPO. And typically, you go around the room, and somebody very senior will say, I think this has gotta be the case, and we're going to go with this project. We want to instead of relying on the HIPPO opinion, test a lot of different ideas and then make a data-driven decision. That works both in the UX and at the back end algorithm, and that works really, really great. I'll add with one other final example. Something that I think is related but very hard to test in some of the alternatives to online experiments. So you can think about doing an experiment in a lab or using sketches, or using other things. The one thing that people don't often appreciate, is that the implementation matters a lot. And one of the things that's really, really surprising is, how much performance matters? So you could come up with a new design. You could come up with a new feature but if it slows the page even by a little bit you're going to see metrics degrade and one of the best experiments that we run, and we run this fairly regularly is what's called a slow down experiment. So we slow down the page by something like 100 millisecond and 250 millisecond, so quarter of a second slow down. It turns out that if you slow the ping by a quarter of a second, which many people think, eh, not going to make a lot of difference, we lose on a lot of our key metrics including one which is easy to quantify which is revenue. We are going to make a lot less money. And to take that through the internal way we phrased it Bing, if you're able to improve server performance by ten milliseconds, right. So that's about 1 over a 30th over the speed that your eye blinks, you just paid for your fully loaded annual costs. This is how much performance matters. And that's one of the things that I really like because when people introduce features, they may slow down the page and now we give them a trade off equation. It's going to cost you if you slow down the page. Your feature may be needs to be so good to compensate for the fact, that many of our metrics degrade when we introduce a slowdown to the page. Okay, so hopefully there's a series of examples here. Some of them people may find very interesting. >> That's fantastic. So with your experience being MSM. How do you see AB testing relating to the other forms of traditional usability, are you still using walk-throughs, checklists, lab usability tests, is it another layer or does this end up replacing some of those other methods in the way that you work? >> So I think it's a great question and one that we've actually discussed. There's a paper, if people are interested in getting more of the details. That paper was published in KDD called Online Experiments at Large Scale. We have a section where we discuss this notion that ideas funnel, so we have in general more ideas than we can implement, right? In AB testing one of the disadvantage I would say is you have to build a code. Someone actually has to build this thing that we can deploy to real users, so it has to be of at least a minimum bar in order to be able to run with this out there. So what we do is we have ideas, find lots of ideas out there are being thrown out. We go through some evaluation through the mechanism that you present that whether it's reviews by people of experience, whether it's some checklist, usability labs. We may show people sketches, all these things are used in order to reduce the funnel of ideas into something more manageable. Then we get to the point where we say, okay, these ideas we think are good. We're going to implement them, or we're going to implement three variants of each IDS, so that we can actually see how well each of them will do in an ABCD test setting and then we deploy. But there's a few very, very interesting observations to make when you think about the model of the final coming down, and with the fact that the implementation is really the bottom line. So the first one is, if something is easy to implement, skip all the prior checks. Right, this example of opening Hotmail, or Outlook.com in a new Window, it's only one line change. Right, don't go through trying to show this to people, don't through lab reviews, just ship it out there and see what the AB test does to you because you're going to get results much more accurately. The real question that we have on these early stages on the funnel is what is the fidelity of those things, right? How predictive are we of the end result that the AB test gives us? We consider the AB test to be the truth, right. This is real users using our product and this is perfect information or very close to perfect information. So the question is. Are the stages before we filter ideas? What is their fidelity? And we know that the fidelity is not very good in. How do we know that because first of all, most ideas that we try fail, right. In general, we have this sort of overall statistics at Microsoft about a third of your ideas that you're actually getting to the point where you implement them and ship them in AB test, a third of them are going to be good. About a third are going to be no op. Meaning, eh, you thought it was going to work, nothing is significant or very few things, or just borderline. Then probably not worth shipping especially if it's new code, that means introducing new bugs in maintenance. And then the surprising thing is that about a third are actually bad. Meaning here's an idea that you thought was good, and when we shipped it out to users it's actually hurting the metrics that's a very humbling situation. And I'll say that in an area like Bing which has been sort of been running AB tests for a long time it's actually worse than one-third. About 80% of our ideas are either flat or negative, so it's very, very hard to find, really, really good ideas. And so when we go and we think about this funnel, we always have to say, okay, how confident are we that we're pruning ideas early on. And we have wonderful examples where some idea it was rated low and nobody rated it high. And then at some point some guy wrote the feature over the weekend, and it was worth $50 million. >> [LAUGH]. >> So those are the things that I think are very humbling, those are the things that cause us to think that there's really not a replacement for trying things out there. We can do some pruning but if we're not sure if something is really novel and we don't have experience with it. We sometimes have to take the plunge, and build what we call an MVP, in the industry, which is a minimal viable product. Well some version has the essence of the idea may be launched into a small population that you don't have to worry about you're in the UX space, so maybe you build it just for one browser. Don't worry about compatibility with the other browsers. If it doesn't work on the main browser that your users use on the site, chances are, it's not going to work. And you don't have to worry about it generalizing to the rest of the browsers and worrying about JavaScript errors or other compatibility issues. >> Great. So you really are focusing on yeah, there's other techniques when you're at the early prototype stage at low fidelity. But there's a limit to how far they can go before you need to really figure out the real experience. So assuming we've got a bunch of people who are watching this that now want to think about doing these AB field tests. Do you have any useful shortcuts, pitfalls they should avoid, or pointers to references that will help them get going. >> Yeah, great question. And this is actually one of the things that we have focused on over the years is coming up with The areas where we made mistakes and we learned to correct for them or catch them early on. So if you think about RA Fisher, in the 1920s, wrote the classical book and papers on how to run controlled experiments, the theory was solid. The practice, getting this from the simple t-test, to something that actually, you run in practice, has a lot of pitfalls in it. And if you can go to exp-platform.com, this is a website where we publish a few papers, a couple of them were worth mentioning and related to this question. We wrote a paper called Seven Pitfalls, where we discuss exactly those questions. I'll mention a few of them. And there's another fun paper we called Five Puzzling Outcomes where something ran and it was so surprising to us that we decided to go deep and understand why the results were so strange. And we have the answers for those. It's kind of a challenge to see. Here's a puzzling outcome. We may spend weeks and months trying to drill to what was going wrong, and then we share it with the readers. There's a few concepts that I think are extremely useful to share when people run experiments. The first one is triggering. If you're analyzing a population, it is very important that you start the analysis on users that have potentially seen a difference. Let's take the example I gave before where the Outlook link either opens in place or opens in a new tab. Only about 5, 7%, or maybe 10% of people that go to the MSN actually click on the Outlook link. If I were to analyze 100% of people,90% of them or 93% of them would be noise. Meaning, they followed the control, they followed the treatment but really didn't see the feature that I wanted them to see. It's only that, it's called 7% of people that click on the Outlook or Hotmail link that actually saw a difference. So the point is make sure to analyze the users that are triggering. And by triggering, we say they were either in the control, but would have seen something different if they were in the treatment, or in the treatment and actually saw something different. There's a certain point where somebody clicks on the link, and then the behavior changes. Only those users that click on the link should be analyzed. And if you do that, you dramatically increase your power, the statistical power to determine if there was a different between the control and the treatment. So very important thing. A second point, a second pitfall has to do with choosing your metrics right. It is not as obvious as people think to decide what the criteria is to optimize. I'll take Bing as an example. If you look at the highest measures of the people who run Bing, they're being measured on things like query share. Meaning, what share of US queries is Bing getting. It's around let's say 21% of queries is the current number. And then the other metric that they look at is revenue. How much revenue are they generating. But you pick those metrics to run AV tests, you would be doing a tremendous disservice to your users. One example that we shared in the puzzling results is, if you degrade the algorithmic results, like show worse results, you will get more queries. In fact we showed an example that we had a bug where things were ranked really badly and users increased their queries by 30%. Why, because they type something and they get poor results. So they reformulate and add a word and they get poor results. So they try this a few times until they find what they need, but it causes them to issue more queries. Likewise if you show rhythmic results and there's a bug but the ads continue to be at the same level that they were, well relatively speaking, they're better. So people start clicking more on the ads and we make more money. If all you did was measure creation revenues, you should fire the relevance team which will degrade the results. And that's probably going to work for a couple of months before all of your users have [INAUDIBLE]. So that's a very, very important lesson about how to pick the right metrics. And we show in the paper that one of the things that we try to do is we try to grow a query share. But the way to grow it is not through people issuing more queries in a session or in a task, but rather lowering the number of queries per task. But have them coming to you for more tasks. So sessions per user is one of the metrics that we consider to be our north star. Because if people come to your more then it is a really good sign that what you built works well. I'll mention maybe one more thing that we found highly, highly valuable and this is the concept of an AA test. So everybody talks about AB tests or ABC tests. Well it makes a lot of sense to run something called an AA test where you split your users into two groups. You let the whole experimentation system run. You instrument everything, and you look at the results. And what you want to see at the end is that 95% of the time, your system tells you there is no statistically significant level at p value 0.05. And it's amazing how many times we run those AA tests against new clients or against modifications of the system, and they fail at a much higher rate than 5%. So a very, very useful tool. Run AA test. Make sure your overall infrastructure of your overall system is actually giving you the that you want, which is 5%. >> Wonderful well we will put a link to the site that you mentioned so that people can find those papers and other references that you've got there. I want to thank you. This has been really interesting. It gives us a way of thinking about the scale of testing that when we're sitting around with hand-drawn prototypes is not quite accessible to us yet, but matters a huge amount in industry practice. And just to remind people, this has been Ronny Kohavi from Microsoft and we've been discussing industry practice, massive A/B testing. We will see you soon.