Archive for the ‘Technology’ Category

Cloudcamp Bangalore 2010 and Hadoop Summit

The 2nd CloudCamp Bangalore was held at Dayanand sagar College of Engineering. It was co-located with the First Hadoop summit in India. The Hadoop summit was interesting and more relevant to me as I am using a Hadoop cluster for Analytics at Inmobi. Dave kicked off Cloudcamp with signature “unPanel”. I was on the Unpanel this time and answered some questions on mobiles, netbooks and smartphones as access devices for the cloud and the on impact of Google patent on MapReduce.

The corridor discussions with a bunch of Hadoop committers were insightful. I also found out more about Mahout. Mahout is a Apache project to build scalable machine learning libraries. It is not restricted to Hadoop implementations, but much of the current activity seems to be around Hadoop.

Notes and embedded slides from the sessions I attended follow:

Hadoop summit Keynote

Data Management on Grid

Notes:

  • Y! uses a HDFS replication factor of 3 (the hadoop default) in most cases. Exceptions are big clusters with large number of applications running simultaneously.
  • Y! does not use Avro yet due to large amount of legacy data. Twitter uses Avro.
  • Data ingestion layer uses MapReduce for heavy lifting and format conversion for storage.
  • LZO is used for compression. gzip (not ideal due to non-block-level indexing) and bzip2 is also used. There are problems with slowness of bzip2 decompression but bzip2 delivers better compression ratios.
  • Data ingestion layer also oversees policy for data retention and purging.
  • Underlying filesystems is rarely a bottleneck for Hadoop. Mostly the synchronization semantics of HDFS is a bottleneck. A file operation is not successful until all the replicas are in sync.

Machine Learning using Hadoop

Notes:

  • There are clear differences between data mining and machine learning.
  • ML is harder to implement efficiently on Hadoop. Improving efficiency is still a research problem.
  • Hadoop creates one map job / block creating too many empty files and also many reducers.

Optimizing and Benchmarking Hadoop

Notes:

  • As a thumb rule, adding as much memory as money can buy is a a good idea for Hadoop
  • Consider Network connections as shuffle stage does heavy network I/O
  • Solid state disks might make sense at certain price/performance ratios. They are also more power efficient.

Tuning Hadoop To Deliver Performance To Your Application

Notes:

  • Several parameters to tune Hadoop but must be used in conjunction with each other.
  • Set number of map jobs slightly more than number of cores to ensure better utilization. Makes sure that data is processed in waves. Also better network utilization (as shuffle phase happens parallely with Map phase) along with CPU scheduling
  • Choosing a good HDFS block size is important. Number of HDFS blocks is directly proportional to number of Map tasks generated

Links to all presentations

ACM Compute 2010 and ACM India launch

ACM Compute 2010 concluded yesterday. It is the flagship conference of the ACM Bangalore chapter. This year was the 3rd edition of the conference and more than 500 people attended the conference. The highlight of this year’s conference was the launch of ACM India. ACM wants to increase it reach in India and ACM India Council consisting of 18 leading computer scientists from academia and industry are heading this initiative.

The ACM India launch was addressed by 3 Turing Award Winners – Barbara Liskov, C.A.R Hoare (Tony Hoare) and Raj Reddy. The ACM Turing award is “The Nobel Prize for Computing” and it is rare to see three Turing Award winners address the audience at any event. Barbara Liskov is the most recent awardee of the Turing award (the 2nd woman to win it) and she spoke on the power of abstraction. She spoke about the problems early programmers faced when writing large and complex programs. She explained how she tried to solve it using abstractions similar to (what is now called) Object-oriented programming. She talked at length on how her insights and experiences with these programming problem led to design of the CLU language. CLU was the first language to implement iterators and generators (as well as exception handling). It was a good lesson in computer history listening to her. I learned later that she was the first woman to get her PhD from a Computer Science Department. (Her doctoral advisor was the legendary John McCarthy). Her presentation and the mentioned references in it make for good reading.

Dr Raj Reddy is the only Indian who has won the Turing award for his contributions to field of Artificial Intelligence. Incidentally, his PhD advisor was also John McCarthy – AI Pioneer and Turing Award winner. Dr Raj Reddy spoke about the growth of computing over the years and the challenges of reaching the “bottom of the Pyramid”. He explained why there was need to move from the WIMP-paradigm in user interfaces to the SILK (Speech, Image, Language and Knowledge) to increase the reach of computing. His Turing award lecture (“To dream the possible dream”) makes for interesting read as well.

C.A.R Hoare (Tony Hoare) was the next speaker. He is a living legend in computer science. I was looking forward to hearing him speak as I had studied the Quicksort algorithm (which he invented) and Communicating Sequential Processes paper in college. He was remarkably witty and his enthusiasm for computer science shone through in his talk. In particular he spoke about the Verified Software initiative which he contended was similar in scope and impact (for Computer Science) to the Hubble Telescope and the Human genome project.

The following 2 days, we had the ACM Compute 2010 conference and there were several hands-on Tutorials on Cloud Computing, Rich Internet Applications and Web 2.0 apps, Widgets and Mobile Applications. The RIA tutorial was conducted by Mrinal Wadhwa (slides embedded below) and the Facebook connect tutorial by Prateek Dayal (of Muziboo).

(Disclosure:I am the secretary of the Bangalore Chapter and am on the program committee for ACM Compute 2010.)

Coders At Work Review

Coders At Work

Once in a while, you read a book that is filled with ‘aha’ moments. If you have written complex software for a while or want to become a good programmer then ‘Coders at work’ is a must read. This fantastic book interviews 15 master programmers. Some of the people interviewed in the book are well-known names such as Don Knuth, Ken Thompson, Jamie Zawinski and Peter Norvig.

Some comments on the content of the book:
Programming languages
Many of programmers interviewed started with BASIC and considered it an okay language. What is probably more surprising is the universal hatred of C++ in this group. In fact several people such as Peter Norvig and Ken Thompson (who goes on a tirade against C++) consider it a downright ugly and cumbersome language to work with.

Jamie Zawinski – C++ is just an abomination
Brad Fitzpatrick – The syntax is terrible and totally inconsistent and the error messages, at least from GCC, are ridiculous.
Ken Thompson – - By and large I think it’s a bad language. It does a lot of things half well and it’s just a garbage heap of ideas that are mutually exclusive. Everybody I know, whether it’s personal or corporate, selects a subset and these subsets are different. So it’s not a good language to transport an algorithm—to say, “I wrote it; here, take it.” It’s way too big, way too complex. And it’s obviously built by a committee.

On Programming and Curiosity
Almost everyone interviewed still programs (some do occasionally) and enjoys hacking and taking things apart. Many were misfits and took unusual career paths to get to where they are today. There is a rebel and hacker streak in all of the them. Most of them stumbled into programming and discovered that they were good at that at some point. Everyone emphasized the practice of writing good code readable code. Everyone laments that you cannot understand a system from the bottom upwards as systems have become more and more complex and layers of abstraction have multiplied manifold.

On categorizing programming and building software
The opinion is pretty much evenly split on whether programming is a science, art, craftsmanship or engineering with a slight bias towards craftsmanship.

On Recommended Books
Among the books recommended, “The Art of computer programming” by Don Knuth topped the list for obvious reasons. Another books which was recommended by several people was the “Psychology of computer programming” by Gerald Weinberg.

On the state of computer science
The mood on the state of developments in computer science was fairly pessimistic and most people pointed to the fact that many of the breakthrough ideas for computer science were conceived in the ’70s (with the notable exception of the internet and web programming)

The only downside here is the interview of Fran Allen. It should not have made the book. I got the distinct feeling that much of the work that she claimed credit for is implemented by others and she was the manager of those projects (probably a good one but that is hardly the same as being a good programmer).

I have added some notes (for further reading) and quotes from the book on the wiki

Tweetup with Alexis Ohanian – Reddit Cofounder

Tweetup with Alexis Ohanian Tweetup with Alexis Ohanian
Tweetup with Alexis Ohanian Tweetup with Alexis Ohanian

Alexis Ohanian ( kn0thing on twitter) – the co-founder of reddit (and the creator of the beloved Reddit Alien) was in Mysore for the TED conference. He took a break from the TED conference to meetup with a bunch of redditers. For those who don’t know he is also the publisher of XKCD books and all the proceeds from the book go to building a school in Laos. It was interesting talking to him about startups, startup school, Paul Graham, ycombinator, traveling in India, the startup scene in India, Social media [link to TED Presentation] and of course reddit.

We gave him a sampling of Indian food (Coconut Groove) and sweets (K C Das). Thanks to @dhempe and @pswam for organising this tweetup.

The James Bond of Datacenters

Conference room

It all started with a faint memory in the increasingly transient nature of the Internet. I vaguely remembered reading about a datacenter which was based somewhere near Stockholm and was built inside an (decommissioned) nuclear bunker. A little bit of googling later I realised that I had read about it on the Pingdom site. I thought to myself that since this is a conference where networking geeks converge, there is a good chance that some attendee knows the people in the Bahnhof ISP. So I sent this mail. Initially no one replied, but then the number of people interested in seeing the datacenter just ballooned. Two people stepped forward to arrange the visit during the lunch break at the conference and the CEO of Bahnhof, Jon Karlung personally gave guided us through this fantastic datacenter.

The entrance to the Datacenter

There is actually a house on top of that hill and there is pathway from the datacenter that opens up onto the top of the hill. The datacenter is actually dug out of hard rock (granite) as can be clearly seen from the pictures of the server floor at the end of the post. It was originally a military bunker and nuclear shelter during the Cold War era. The Code name from the military days Pionen White Mountains can be seen in the photos of the entrance to the datacenter.

Entrance to the White Mountain DC Entrance

The Backup Generators

2 Maybach MTU diesel engines that produce 1.5 Megawatt of AC power provide backup power. The engines were originally designed for German submarines. There is a warning horn from a German submarine that add to the effect :)

The Backup power room German Submarines Engine

The Conference Room

The Conference room and the pathway leading to it is made completely out of metal and glass and it hangs above the server flow adding tothe futuristic space staion look of the datacenter (as can be seen from the picture at the top of the post). There is also a Tintin theme rocket [ See Destination Moon for the Tintin Reference ]

The Tintin Rocket View from the bridge

The Fountains

There are lots of plants around the datacenter to reduce the claustrophobic feel of the bunker and make it like a more natural working environment. The Fountains at the entry also make up the decor of the place but are generally switched off as they make a lot of noise.

Another view of the Fountain The Fountain

The Netops and the Leisure room

This is Bahnhof’s biggest facility in Sweden and this Network operations room is used for running the ISP. The leisure room has a huge fishtank to add to natural feel of the place.

The LCDs in the NetOps room The NetOps room
The CEO and the DC manager The Leisure room

The Server Floor

Some of the walls of the server floor are unadroned and are made of bare rock giving away the initial use of the facility as a Cold war Era nuclear bunker.

Servers The Servers

Bahnhof uses the unique nature of the datacenter for marketing purposes. IT is actually possibe to co-locate your servers here. The Pionen Datacenter gives a whole new meaning to disaster recovery backup. :)