Archive for March 16th, 2010

Cloudcamp Bangalore 2010 and Hadoop Summit

The 2nd CloudCamp Bangalore was held at Dayanand sagar College of Engineering. It was co-located with the First Hadoop summit in India. The Hadoop summit was interesting and more relevant to me as I am using a Hadoop cluster for Analytics at Inmobi. Dave kicked off Cloudcamp with signature “unPanel”. I was on the Unpanel this time and answered some questions on mobiles, netbooks and smartphones as access devices for the cloud and the on impact of Google patent on MapReduce.

The corridor discussions with a bunch of Hadoop committers were insightful. I also found out more about Mahout. Mahout is a Apache project to build scalable machine learning libraries. It is not restricted to Hadoop implementations, but much of the current activity seems to be around Hadoop.

Notes and embedded slides from the sessions I attended follow:

Hadoop summit Keynote

Data Management on Grid

Notes:

  • Y! uses a HDFS replication factor of 3 (the hadoop default) in most cases. Exceptions are big clusters with large number of applications running simultaneously.
  • Y! does not use Avro yet due to large amount of legacy data. Twitter uses Avro.
  • Data ingestion layer uses MapReduce for heavy lifting and format conversion for storage.
  • LZO is used for compression. gzip (not ideal due to non-block-level indexing) and bzip2 is also used. There are problems with slowness of bzip2 decompression but bzip2 delivers better compression ratios.
  • Data ingestion layer also oversees policy for data retention and purging.
  • Underlying filesystems is rarely a bottleneck for Hadoop. Mostly the synchronization semantics of HDFS is a bottleneck. A file operation is not successful until all the replicas are in sync.

Machine Learning using Hadoop

Notes:

  • There are clear differences between data mining and machine learning.
  • ML is harder to implement efficiently on Hadoop. Improving efficiency is still a research problem.
  • Hadoop creates one map job / block creating too many empty files and also many reducers.

Optimizing and Benchmarking Hadoop

Notes:

  • As a thumb rule, adding as much memory as money can buy is a a good idea for Hadoop
  • Consider Network connections as shuffle stage does heavy network I/O
  • Solid state disks might make sense at certain price/performance ratios. They are also more power efficient.

Tuning Hadoop To Deliver Performance To Your Application

Notes:

  • Several parameters to tune Hadoop but must be used in conjunction with each other.
  • Set number of map jobs slightly more than number of cores to ensure better utilization. Makes sure that data is processed in waves. Also better network utilization (as shuffle phase happens parallely with Map phase) along with CPU scheduling
  • Choosing a good HDFS block size is important. Number of HDFS blocks is directly proportional to number of Map tasks generated

Links to all presentations

Get Adobe Flash playerPlugin by wpburn.com wordpress themes