Archive for May, 2009

Supercrunchers – How Data Analysis is Changing our Lives

supercrunchers

I had been planning to read Supercrunchers by Ian Ayres for a while. The theme of the book is how you can educe information from raw data by applying various statistical techniques such as regression analysis and corelation algorithms. The author calls this process as “Supercrunching”.

The book starts with an introduction to supercrunching with the example of Orley Ashenfelter who devised a mathematical formula to determine the quality of wine (and hence it’s price). Wine connoisseurs initially dismissed the idea that a formula could beat their intuition and years of experience but the formula has stood the test of time and outperformed the experts in the field by a large margin. Another example is the loss of Kasparov to Deep Blue but the author fails to mention that Kasparov not only lost because of the huge amount of data that Deep Blue had at it’s disposal but also the team of scientists who were continually refining the algoritms to get to the right moves.

The first chapter ‘Whos doing the thinking for you ?’ dwells on the topic of recommendation engines such as the Netflix, Amazon, Last.fm and Pandora. (IEEE spectrum article on the winning algorithm by the current leaders of the Netflix prize). Recommendation engines and collaborative filtering has become almost as norm on the web today with almost every news site (most emailed, most read, most shared), shopping site (people who bought this also bought … ), media sites (recommendations for artists, songs, albums) and social networking sites (such and Linkedin and Facebook’s ‘People you may know’ feature) having such features built-in. A slightly different example in the book is that of Walmart which uses answers to certain questions to filter out (non-conformist) candidates for certain job profiles based on the data they have gathered from current employees. Internet search giant Google recently began crunching data from employee reviews and promotion and pay histories in a mathematical formula which it says can identify which of its 20,000 employees are most likely to quit. Fraud detection is another area which this chapter skims but you can write a book on just data mining and fraud detection.

The second chapter ‘Creating your own Data with a Flip of a Coin’ deals with testing hypotheses by running random trials to test the efficacy of a prediction especially in cases in which the cost of running the trial is low. Lots of companies are known to use this. For example, when Google or Yahoo change their front page or layout, different users see different versions of the page (which are slightly different from each other often in one single aspect) and the user behavior is then mapped and ranked based on a whole bunch of criteria. This goes on in several iterations till the design is perfected. This technique is called A/B testing. The efficacy of trail run can be improved by using the Taguchi methods. I found the author’s approach to using randomised trials to find answers similar to what Nassim Taleb’s advocacy of using ‘Monte Carlo Methods‘ in ‘Fooled by Randomness’ for random sampling.

The third chapter ‘Government by chance’ elucidates how randomised trial is helping change and improve social policies. Examples include Mexico’s Progresa initiative and MIT’s Poverty action Lab.

The chapter ‘How should physicians treat evidence based medicine ?’ shows how evidence-based medicine is profoundly changing medical practices. The most interesting story in the chapter is of Ignaz Semelweiss who used statistical techniques to figure out that chances of transmitting puerperal fever (a form of septicaemia) could be prevented by having doctors wash their hands on chlorinated solutions. Unfortunately he was not only ridiculed by many, admitted to an asylum and an in an ironic twist of fate died due to septicaemia only a fortnight later. All this before Louis Pasteur developed the germ theory of disease. The chapter also relates stories of how rule based system such as Isabel are helping avoid misdiagnosis of patients and assisting doctors to help treat patients faster.

The chapter ‘Experts Versus Equations’ expands on the underlying and subtle theme of the book – that supercrunching is slowly but surely defeating the experts in various fields (given accurate and adequate data) and how experts’ intuition is not always right. It goes on to explain how biases and emotions cloud our judgement and reduce the accuracy of our predictions. (A good aside at this point is reading up on Positive confirmation Bias – our tendency to search for data that confirms out perception and Cognitive dissonance – the uncomfortable feeling caused by holding two contradictory ideas simultaneously). The example given is how a very simplistic algorithm outperformed legal experts in determining how the supreme court justices of the US voted.

In the chapter ‘Why Now ?’, the author makes a compelling case of why supercrunching is becoming more and more relevant and accessible to ordinary people due to the rise in data crunching ability (Moore’s Law), falling cost of storage for large datasets (Kryder’s law) and ease of distribution and verification (The Internet). The author introduces neural network in this chapter and relates the story of the Epagogix – A firm that uses neural networks to improves the commercial gross of a film by tweaking movie scripts. Yes – unbelievable but who knew. It gives a whole new meaning to formulaic films. :)

In the chapter ‘Are we having fun yet ?’ the author touches upon the topic of education and how direct instruction is changing education in America. This was probably one of the most counterintuitive examples in the book. Direct Instruction replaces the discretion of the teacher with a behavioral script for teaching students. Teachers have resisted this method of teaching despite overwhelming evidence that this is the best way for teaching students. Again the conflict between intuition and hard numbers crops up in this chapter. Another theme in this chapter is how megacorps are using data (such as our buying decisions & browsing behavior) generated us to make decisions to increase their profitability. A recent NYTimes artcile touches upon this topic as well. (What Does Your Credit-Card Company Know About You?)

The Final Chapter ‘The Future of Intuition (and Expertise)’ explains some basic statistics in layman terms such as the 2SDrule, Bayes theorem and margin of error in statistical surveys. I wish this section were more extensive though.

Overall a good read (3/5), but I wish it had more examples and were more comprehensive. Ian Ayres also writes on the freakonomics blog. Also I recently read that IBM is working on software to analyze trends based on realtime data.

More @ Google Talk by Ian Ayers.

ACM Tech Talk – April 2009 – FreeBSD/PMCTools

Joseph Koshy is a FreeBSD developer who has been contributing to FreeBSD for more than a decade. He recently gave a talk on “FreeBSD and PMCTools” as part of the ACM Bangalore Tech Talk Series. I was aware of Koshy’s work due to my interest in both system level performance monitoring (I have given a talk on this topic at Foss.in 2005) and FreeBSD. Having worked with porting BSD userland as part of my earlier work on SFU at Microsoft, I had read his name several times in FreeBSD Release notes and Changelogs. I was really surprised to hear from him that this was his first ever public lecture on FreeBSD/PMCtools tools.

Joseph started with an introduction to FreeBSD and the BSD philosophy (In short – don’t worry about ideology just code :). He then explained how the architecture of PMC tools evolved and the challenges of writing such a tool. One of the challenges of performance counters is that the number and nature of performance counters changes with every generation of Intel/AMD chips so some of the assumptions go for a toss. So the code has to be writen with long term portability in mind while keeping the UI (command line in this case – A GUI interface is in the works but not a top priority. If you are interested in contributing to FOSS, this is a good opportunity – get in touch with Joseph). The PMCtools code is well written and is referenced in a “Communications of the ACM” (CACM) article as an example of beautiful and well-written code. His presentation at the ACM Tech Talk follows. It can be downloaded here

Cloud Camp Bangalore

This is a belated post as I was really busy coming back from injury trying to catch up with work.

Cloudcamp Bangalore went off well in end March with good attendance. Out of the 400 people who registered, more than 300 people turned up. There were quite a few interesting discussions.

The format was a cross between a normal conference and an an unconference. Some of the sessions were planned beforehand (not scheduled on the spot as in an unconference). The “UnPanel” format was a big hit. In the UnPanel, we asked who amongst the audience felt they were the experts on a certain topic (such as security, networks, storage etc) and invited them onstage to form the panel. Then we asked the rest of the audience, what questions they expected to be answered during the conference. Then we wrote down these questions on the board and got the unpanelists to answer them. This got the audience involved and spawned some animated conversations which continued over lunch.

There were 4 parallel sessions after lunch on a variety of topics related to CloucComputing. Here are some presentations from those post-lunch sessions:

Snappyfingers

Snappyfingers is a FAQ search Engine. Snappy Fingers uses S3, SQS and EC2 services from Amazon to run it’s infrastructure. Chirayu Patel explains some of the lessons from building a search engine using Cloud services in the following presentation.

ACK Media

ACK Media is also using S3 and EC2 Cloud services to build a MMORPG. based on Indian epic Mahabharata. Arjun Gupte from ACK media explains why he used cloud computing to build the game and the architecture of the game in the following presentation.

Akamai State of the Internet Report – Q4 2008

stoi_q4_2008

Akamai recently released “State of the Internet” report for Q4 2008 which contains information about attack traffic (ss measured from more than 193 countries and attacking more than 20,000 ports), significant Internet events and Internet penetration and broadband adoption. Akamai has a good view of what happening on the internet as it’s distributed network has more than 42,000 servers deployed in almost all major ISPs across the world. Some highlights from the report:

  • US and China were the top 2 countries from where attack traffic originated.
  • Attack traffic spikes on so-called “Attack Wednesday” which typically follows Microsoft’s “Patch Tuesday” when Microsoft releases patches fixing the vulnerabilties in it’s software.
  • There is a push towards deployment of DNSSEC protocol after the DNS vulnerability uncovered vy Dan Kaminsky was made public.
  • Fixed broadband adoption and capacity is increasing in several countries including Switzerland, Brazil, Italy Spain, Canada and India.
  • Nordic Countries (Norway, Sweden, Finland, Denmark and Iceland) continue to lead the world in internet penetration.

The “State of the Internet” report contains lots of interesting information about security, internet penetration and adoption of technology on the Internet. Do check the endnotes for further links and references for the information in the report. You can download the report here.

There are similar reports by

Get Adobe Flash player