I had been planning to read Supercrunchers by Ian Ayres for a while. The theme of the book is how you can educe information from raw data by applying various statistical techniques such as regression analysis and corelation algorithms. The author calls this process as “Supercrunching”.
The book starts with an introduction to supercrunching with the example of Orley Ashenfelter who devised a mathematical formula to determine the quality of wine (and hence it’s price). Wine connoisseurs initially dismissed the idea that a formula could beat their intuition and years of experience but the formula has stood the test of time and outperformed the experts in the field by a large margin. Another example is the loss of Kasparov to Deep Blue but the author fails to mention that Kasparov not only lost because of the huge amount of data that Deep Blue had at it’s disposal but also the team of scientists who were continually refining the algoritms to get to the right moves.
The first chapter ‘Whos doing the thinking for you ?’ dwells on the topic of recommendation engines such as the Netflix, Amazon, Last.fm and Pandora. (IEEE spectrum article on the winning algorithm by the current leaders of the Netflix prize). Recommendation engines and collaborative filtering has become almost as norm on the web today with almost every news site (most emailed, most read, most shared), shopping site (people who bought this also bought … ), media sites (recommendations for artists, songs, albums) and social networking sites (such and Linkedin and Facebook’s ‘People you may know’ feature) having such features built-in. A slightly different example in the book is that of Walmart which uses answers to certain questions to filter out (non-conformist) candidates for certain job profiles based on the data they have gathered from current employees. Internet search giant Google recently began crunching data from employee reviews and promotion and pay histories in a mathematical formula which it says can identify which of its 20,000 employees are most likely to quit. Fraud detection is another area which this chapter skims but you can write a book on just data mining and fraud detection.
The second chapter ‘Creating your own Data with a Flip of a Coin’ deals with testing hypotheses by running random trials to test the efficacy of a prediction especially in cases in which the cost of running the trial is low. Lots of companies are known to use this. For example, when Google or Yahoo change their front page or layout, different users see different versions of the page (which are slightly different from each other often in one single aspect) and the user behavior is then mapped and ranked based on a whole bunch of criteria. This goes on in several iterations till the design is perfected. This technique is called A/B testing. The efficacy of trail run can be improved by using the Taguchi methods. I found the author’s approach to using randomised trials to find answers similar to what Nassim Taleb’s advocacy of using ‘Monte Carlo Methods‘ in ‘Fooled by Randomness’ for random sampling.
The chapter ‘How should physicians treat evidence based medicine ?’ shows how evidence-based medicine is profoundly changing medical practices. The most interesting story in the chapter is of Ignaz Semelweiss who used statistical techniques to figure out that chances of transmitting puerperal fever (a form of septicaemia) could be prevented by having doctors wash their hands on chlorinated solutions. Unfortunately he was not only ridiculed by many, admitted to an asylum and an in an ironic twist of fate died due to septicaemia only a fortnight later. All this before Louis Pasteur developed the germ theory of disease. The chapter also relates stories of how rule based system such as Isabel are helping avoid misdiagnosis of patients and assisting doctors to help treat patients faster.
The chapter ‘Experts Versus Equations’ expands on the underlying and subtle theme of the book – that supercrunching is slowly but surely defeating the experts in various fields (given accurate and adequate data) and how experts’ intuition is not always right. It goes on to explain how biases and emotions cloud our judgement and reduce the accuracy of our predictions. (A good aside at this point is reading up on Positive confirmation Bias – our tendency to search for data that confirms out perception and Cognitive dissonance – the uncomfortable feeling caused by holding two contradictory ideas simultaneously). The example given is how a very simplistic algorithm outperformed legal experts in determining how the supreme court justices of the US voted.
In the chapter ‘Why Now ?’, the author makes a compelling case of why supercrunching is becoming more and more relevant and accessible to ordinary people due to the rise in data crunching ability (Moore’s Law), falling cost of storage for large datasets (Kryder’s law) and ease of distribution and verification (The Internet). The author introduces neural network in this chapter and relates the story of the Epagogix – A firm that uses neural networks to improves the commercial gross of a film by tweaking movie scripts. Yes – unbelievable but who knew. It gives a whole new meaning to formulaic films.
In the chapter ‘Are we having fun yet ?’ the author touches upon the topic of education and how direct instruction is changing education in America. This was probably one of the most counterintuitive examples in the book. Direct Instruction replaces the discretion of the teacher with a behavioral script for teaching students. Teachers have resisted this method of teaching despite overwhelming evidence that this is the best way for teaching students. Again the conflict between intuition and hard numbers crops up in this chapter. Another theme in this chapter is how megacorps are using data (such as our buying decisions & browsing behavior) generated us to make decisions to increase their profitability. A recent NYTimes artcile touches upon this topic as well. (What Does Your Credit-Card Company Know About You?)
The Final Chapter ‘The Future of Intuition (and Expertise)’ explains some basic statistics in layman terms such as the 2SDrule, Bayes theorem and margin of error in statistical surveys. I wish this section were more extensive though.
Overall a good read (3/5), but I wish it had more examples and were more comprehensive. Ian Ayres also writes on the freakonomics blog. Also I recently read that IBM is working on software to analyze trends based on realtime data.
More @ Google Talk by Ian Ayers.