A few weeks ago, Andrew Gelman gushed over a report published last fall. The authors, six Microsoft researchers, spend their work days experimenting on people who use Microsoft web products like Bing. If the team wants to find out the best layout for a real estate search page, for example, an experiment might direct different users to different webpages and record how many users clicked on ads. The report describes five puzzling results which came out of this sort of experiment, and then explains what the Microsoft researchers discovered was going on.
A fun example from the paper is about the peril of reading too much into short-term results. The length of time a user spends on a site and how many ads she clicks are common metrics used to evaluate changes in website design. A poor design will increase the time confused users spend clicking around, and some of those clicks will be on ads. In the short term, the poor design looks good, but, of course, the user will eventually start using another service. Because of a result like this, Amazon was apparently fooled for six months into thinking pop-up ads were a good idea. shudder.
The paper also supports my idea that every field has its own specific jargon. In web development, what an economist would call a randomized controlled trial is called an A/B test. The control group is called A, and the treatment group is B. The Microsoft researchers say that the most useful tool in diagnosing their problems has been the A/A test, which apparently means you create two treatment groups and compare them as if you were doing a real experiment. If your experimental set up is working, the null hypothesis that the two groups are the same will be rejected about 5% of the time. The A/A test reminds me of what an economist might call a monte carlo simulation–simulate data from a statistical model, and then reestimate the model on the simulated data.
The report was fun to read, but it left me a little puzzled. Consider this paragraph from the introduction:
Deploying and mining online controlled experiments at large scale–thousands of experiments–at Microsoft has taught us many lessons. Most experiments are simple, but several caused us to step back and evaluate fundamental assumptions. Each of the examples entailed weeks to months of analysis, and the insights are surprising.
The authors spent months puzzling over the issues they describe in this paper. A few pages later, the authors mention that there are similar experimental research teams working at Google and Yahoo. Andrew Gelman wrote that it took him years to grasp the issues described in the paper. Why are Microsoft researchers publishing solutions to puzzles of the same type the researchers at Google and Yahoo are facing; shouldn’t these insights be trade secrets?
My best guess is that there is some sort of principal-agent dynamic going on. The well written report shows that the authors know what they are doing, and raises their status in the web experimentation community.