Published by Vinayak Hegde on 16th March 2010
The 2nd CloudCamp Bangalore was held at Dayanand sagar College of Engineering. It was co-located with the First Hadoop summit in India. The Hadoop summit was interesting and more relevant to me as I am using a Hadoop cluster for Analytics at Inmobi. Dave kicked off Cloudcamp with signature “unPanel”. I was on the Unpanel this time and answered some questions on mobiles, netbooks and smartphones as access devices for the cloud and the on impact of Google patent on MapReduce.
The corridor discussions with a bunch of Hadoop committers were insightful. I also found out more about Mahout. Mahout is a Apache project to build scalable machine learning libraries. It is not restricted to Hadoop implementations, but much of the current activity seems to be around Hadoop.
Notes and embedded slides from the sessions I attended follow:
Hadoop summit Keynote
Data Management on Grid
Notes:
- Y! uses a HDFS replication factor of 3 (the hadoop default) in most cases. Exceptions are big clusters with large number of applications running simultaneously.
- Y! does not use Avro yet due to large amount of legacy data. Twitter uses Avro.
- Data ingestion layer uses MapReduce for heavy lifting and format conversion for storage.
- LZO is used for compression. gzip (not ideal due to non-block-level indexing) and bzip2 is also used. There are problems with slowness of bzip2 decompression but bzip2 delivers better compression ratios.
- Data ingestion layer also oversees policy for data retention and purging.
- Underlying filesystems is rarely a bottleneck for Hadoop. Mostly the synchronization semantics of HDFS is a bottleneck. A file operation is not successful until all the replicas are in sync.
Machine Learning using Hadoop
Notes:
- There are clear differences between data mining and machine learning.
- ML is harder to implement efficiently on Hadoop. Improving efficiency is still a research problem.
- Hadoop creates one map job / block creating too many empty files and also many reducers.
Optimizing and Benchmarking Hadoop
Notes:
- As a thumb rule, adding as much memory as money can buy is a a good idea for Hadoop
- Consider Network connections as shuffle stage does heavy network I/O
- Solid state disks might make sense at certain price/performance ratios. They are also more power efficient.
Tuning Hadoop To Deliver Performance To Your Application
Notes:
- Several parameters to tune Hadoop but must be used in conjunction with each other.
- Set number of map jobs slightly more than number of cores to ensure better utilization. Makes sure that data is processed in waves. Also better network utilization (as shuffle phase happens parallely with Map phase) along with CPU scheduling
- Choosing a good HDFS block size is important. Number of HDFS blocks is directly proportional to number of Map tasks generated
Links to all presentations
Published by Vinayak Hegde on 22nd September 2009
With winter approaching fast we are also into conference season.
ACM Compute 2010 – 22nd & 23rd Jan, 2010.
ACM Bangalore chapter is organising ACM Compute 2010 which is into it’s third year now. This year the broad theme is Cloud computing and Information retrieval, management and analytics. The aim of this conference is to bring together researchers, practitioners, technology market movers, and thought leaders, with a view to advance the state of the art, and the state of the practice in applied research. This year we are planning to do something special – details soon
The Call of papers (CFP) is out for sometime now and the last date for submissions is Oct 1 2009. You can also submit a proposal for a half-day or day long tutorial. Last year we had a bunch of good tutorials and also the symposium on Cloud Computing co-located with ACM Compute 2009 which was a great success.
Disclaimer: I am on the program committee of ACM Bangalore and am the secretary of the ACM Bangalore chapter.
Pycon India 2009 – 26th & 27th Sep 2009
Also this weekend (26th and 27th September 2009) India’s first Pycon India 2009 is being held in Bangalore. There is an interesting list of talks lined up. So do register if you are interested in attending.
Foss.in – December 1-5, 2009.
Foss.in shifts to a new venue this year NIMHANS convention centre. This year promises to be interesting as the venue is available for longer durations. Also there are going to be hacker evening/nights where tinkerers can meetup and talk about a whole range of stuff not restricted to just FOSS. My educated guess is something on the lines of CCC in Germany. Definitely something to look forward to. Plus I think there will be atleast one evening where we will have music
. So join the mailing list if you are interested in presenting/attending as more details should emerge soon.
Published by Vinayak Hegde on 13th May 2009
This is a belated post as I was really busy coming back from injury trying to catch up with work.
Cloudcamp Bangalore went off well in end March with good attendance. Out of the 400 people who registered, more than 300 people turned up. There were quite a few interesting discussions.
The format was a cross between a normal conference and an an unconference. Some of the sessions were planned beforehand (not scheduled on the spot as in an unconference). The “UnPanel” format was a big hit. In the UnPanel, we asked who amongst the audience felt they were the experts on a certain topic (such as security, networks, storage etc) and invited them onstage to form the panel. Then we asked the rest of the audience, what questions they expected to be answered during the conference. Then we wrote down these questions on the board and got the unpanelists to answer them. This got the audience involved and spawned some animated conversations which continued over lunch.
There were 4 parallel sessions after lunch on a variety of topics related to CloucComputing. Here are some presentations from those post-lunch sessions:
Snappyfingers
Snappyfingers is a FAQ search Engine. Snappy Fingers uses S3, SQS and EC2 services from Amazon to run it’s infrastructure. Chirayu Patel explains some of the lessons from building a search engine using Cloud services in the following presentation.
ACK Media
ACK Media is also using S3 and EC2 Cloud services to build a MMORPG. based on Indian epic Mahabharata. Arjun Gupte from ACK media explains why he used cloud computing to build the game and the architecture of the game in the following presentation.
Published by Vinayak Hegde on 13th March 2009
The first CloudCamp in India is happening on Sunday, March 29th in Bangalore, India. ACM Bangalore is supporting this CloudCamp.
CloudCamp is an unconference where early adopters of Cloud Computing technologies exchange ideas. With the rapid change occurring in the industry, we need a place we can meet to share our experiences, challenges and solutions. At CloudCamp, you are encouraged you to share your thoughts in several open discussions, as we strive for the advancement of Cloud Computing. End users, IT professionals and vendors are all encouraged to participate.
One track will feature invited speakers from early adopter startups, CloudComputing Vendors and Developers. Register below.

If you are interested in sponsoring or submitting a proposal for a talk, send an email to vinayakh AT gmail DOT com.
Some of the proposed sessions are “A introduction to Cloud Computing” by Dave Nielsen, “How to build a Search Engine using AWS” by Chirayu Patel and “How to use Cloud computing to build a MMORPG” by Arjun Gupte.
Tentative Agenda:
Doors open at 10:00am
10:00am: Talks (2 startups on their experience of using clouds)
11:00am: Expert Talk (a survey of cloud platform products – Amazon, Sun, Google, etc)
11:30am: Expert talk (case studies of app architectures that use clouds)
12:15pm: Break for unconference
12:30pm: Unconference session 1
1:30pm: Lunch
2:30pm: Unconference session 2
3:30pm: Unconference session 3
4:30pm: Quick snacks
5:00pm – 6:00pm: Panel discussion and close
Published by Vinayak Hegde on 18th August 2008
Open Source developers contribute code for a variety of reasons:
1. To scratch an itch – fulfill a particular current need and release it to the world hoping that someone finds it useful.
2. To contribute back to the community they are part of or has helped them in the past.
3. Some contribute it because they enjoy writing code and feel altruistic because they help the world.
4. Some release code to help it get widespread adoption (marketing strategy by companies) so they can charge for premium support and build a community of committed contributers.
These varied motivations are visible in the multiple licenses in the Open Source community. The most popular licenses are GNU GPL, GNU LGPL, BSD, MIT and Apache Licenses. The relationship between these open source software (OSS) licenses is illustrated below :
|
|
Attribution: David Wheeler [http://www.dwheeler.com/essays/floss-license-slide.html]
|
Software was earlier delivered via floppies then CDROMS and now software downloads. Enter the internet and the web. They have completed changed the way people work and communicate. The web moved from static html pages to Web 2.0. Software is increasing moving away from the delivery to the “hosted” model. Computing resource acquisitions is moving from buying to renting. This has given rise to new paradigms of delivering software such as Software-as-a-Service or SaaS and Cloud Computing.
Open source software has been one of the key enablers of these new revolution alongwith open standards (HTML, HTTP, CSS, XML etc). Whether it is Linux or Apache or Firefox or Python. As mentioned above the GPL which is by far the most popular Open Source licence. When the GPL was written, the modes of software delivery were either through physical media or by downloading from a FTP server (GPL v2 was written in 1991 when the web was in it’s infancy). The GPL has a strong copyleft clause (called as tit-for-tat by Linus Torvalds) which was crucial to the success of Linux, GCC and MySQL – three of the building blocks of much of the SaaS and Cloud Computing infrastucture. It has given the programmers who contributed to it the confidence that their work would benefit the whole world and remain free for distribution, rather than being exploited by software companies that would not have to give anything back to the community. This ethos is central to the motivation of many of the programmers who contribute to open source software (OSS).
However the GPL has some “loopholes” which Application Service Providers (ASPs) exploit. Since the distribution clauses of GPL v2 (and now GPL v3) do not govern the software whose functionality is accessed over a network (mostly the Internet), ASPs and SaaS companies were able to make changes to OSS and not give them back to the community. The license that fixed this loophole was the Affero GPL v3. This has a clause that governs the usage of a software over the Internet.
| 13. Remote Network Interaction; Use with the GNU General Public License.
Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph.
Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License. |
This clause is important to all Cloud Computing and SaaS vendors as any modifications they make to the software licensed under Affero GPL will have to be released to the users who use that software at nominal or no cost. This has made atleast a few vendors unhappy.