Cloudcamp Bangalore 2010 and Hadoop Summit
The 2nd CloudCamp Bangalore was held at Dayanand sagar College of Engineering. It was co-located with the First Hadoop summit in India. The Hadoop summit was interesting and more relevant to me as I am using a Hadoop cluster for Analytics at Inmobi. Dave kicked off Cloudcamp with signature “unPanel”. I was on the Unpanel this time and answered some questions on mobiles, netbooks and smartphones as access devices for the cloud and the on impact of Google patent on MapReduce.
The corridor discussions with a bunch of Hadoop committers were insightful. I also found out more about Mahout. Mahout is a Apache project to build scalable machine learning libraries. It is not restricted to Hadoop implementations, but much of the current activity seems to be around Hadoop.
Notes and embedded slides from the sessions I attended follow:
Hadoop summit Keynote
Data Management on Grid
Notes:
- Y! uses a HDFS replication factor of 3 (the hadoop default) in most cases. Exceptions are big clusters with large number of applications running simultaneously.
- Y! does not use Avro yet due to large amount of legacy data. Twitter uses Avro.
- Data ingestion layer uses MapReduce for heavy lifting and format conversion for storage.
- LZO is used for compression. gzip (not ideal due to non-block-level indexing) and bzip2 is also used. There are problems with slowness of bzip2 decompression but bzip2 delivers better compression ratios.
- Data ingestion layer also oversees policy for data retention and purging.
- Underlying filesystems is rarely a bottleneck for Hadoop. Mostly the synchronization semantics of HDFS is a bottleneck. A file operation is not successful until all the replicas are in sync.
Machine Learning using Hadoop
Notes:
- There are clear differences between data mining and machine learning.
- ML is harder to implement efficiently on Hadoop. Improving efficiency is still a research problem.
- Hadoop creates one map job / block creating too many empty files and also many reducers.
Optimizing and Benchmarking Hadoop
Notes:
- As a thumb rule, adding as much memory as money can buy is a a good idea for Hadoop
- Consider Network connections as shuffle stage does heavy network I/O
- Solid state disks might make sense at certain price/performance ratios. They are also more power efficient.
Tuning Hadoop To Deliver Performance To Your Application
Notes:
- Several parameters to tune Hadoop but must be used in conjunction with each other.
- Set number of map jobs slightly more than number of cores to ensure better utilization. Makes sure that data is processed in waves. Also better network utilization (as shuffle phase happens parallely with Map phase) along with CPU scheduling
- Choosing a good HDFS block size is important. Number of HDFS blocks is directly proportional to number of Map tasks generated
Links to all presentations
- Hadoop Summit Keynote – USage inside Yahoo and Statistics
- Marketing presentation – Yahoo’s committment to Hadoop and Opensource
- Data Management on Grid
- Machine Learning using Hadoop
- Multiple Sequence Alignment Using Hadoop
- Optimizing and Benchmarking Hadoop
- Challenges And Uniqueness Of Quality Engg And Release Engg Processes In Hadoop
- Tuning Hadoop To Deliver Performance To Your Application

