Why Hadoop on the Cloud is Logical

Over the past year, there have actually been a great deal of new product/ project announcements related to running Hadoop in Cloud environments. While Amazon’s Elastic MapReduce continued with improvements over its base platform, players like Qubole, Mirantis, VMWare, and Rackspace all announced services or product offerings connecting the elephant with the cloud. Tasks like Apache Whirr, Netflix’s Genie began getting noticed. Current announcements at the HadoopWorld + StrataConf summit motivated experts to claim that Hadoop is taking control of the cloud.

It has actually become noticeable that Hadoop in the cloud is a trending topic. This post discovers 6 of the reasons why this association makes good sense, and why customers are seeing increased value in this design.

1. Lowering the cost of development

Running Hadoop on the cloud makes sense for similar reasons as running other software providing on the cloud. For companies still checking the waters with Hadoop, the low capacity investment in the cloud is a no-brainer. The cloud likewise makes good sense for a quick, one time use case including big data computation. As early as in 2007, the New York Times used the power of Amazon EC2 instances and Hadoop for just one day to do a one time conversion of TIFF files to PDFs in a digitisation effort. Procuring scalable compute resources on demand is appealing as well.

2. Acquiring large scale resources quickly

The point above of quick resource procurement requires some elaboration. Hadoop and the platforms it was influenced from made the vision of linear storage and calculate utilizing product hardware a truth. Web giants like Google, who always operated at web-scale, knew that there would be a requirement for operating on increasingly more hardware resources. They purchased building this hardware themselves.

In the venture though, this was not always an alternative. Hadoop adoption in some ventures grew organically from a few departments running tens of nodes to a consolidated medium-sized or big cluster with a couple of hundreds of nodes. Normally such clusters started getting handled by a ‘data platform’ group various from the Facilities team, the latter being responsible for procurement and management of the hardware in the data centre. As the analytics demand within business grew, the need to expand the capacity of the Hadoop clusters also grew. The data platform groups began hitting a bottleneck of a different kind. While the software itself had actually proven its ability of handling linear scale, the time it took for hardware to materialise in the cluster due to IT policies differed from several weeks to several months, suppressing innovation and development. It ended up being evident that ‘throwing even more hardware at Hadoop’ wasn’t as easy or fast as it need to be.

The cloud, with its guarantee of immediate access to hardware resources, is very attractive to leaders who want to offer a platform that scales fast to meet growing business needs. For instance, what could take a number of weeks to get 50 more devices into a data centre would become available in a cloud platform like Amazon EMR in 10s of minutes.

3. Managing Batch Workloads Successfully

Being a batch-oriented system, common use patterns for Hadoop involve scheduled tasks processing new incoming data on a fixed, temporal basis. Business gathering activity data from gadgets or web server logs consume this data into an analytics application on Hadoop when or couple of times in a day and extract insights from them. The load on compute resources of a Hadoop cluster varies based upon the timings of these set up runs or rate of incoming data. A fixed capacity Hadoop cluster improved physical machines is constantly on whether it is utilized or not– consuming power, leased space, etc. and incurring cost.

The cloud, with its pay as you make use of design, is more efficient to deal with such batch workloads. Given predictability in the use patterns, one can optimise even further by having clusters of ideal sizes offered at the correct time for tasks to run. For the example cited above on activity data evaluation, companies can arrange cloud-based clusters to be readily available only for the period of time during the day when the data needs to be crunched.

4. Handling Variable Resource Requirements

Not all Hadoop tasks are produced equal. While a few of them require even more compute resources, some need more memory, and some others require a lot of I/O bandwidth. Usually, a physical Hadoop cluster is built of homogeneous devices, usually huge enough to manage the largest job.

The default Hadoop scheduler has no option for managing this diversity in Hadoop tasks, triggering sub-optimal results. For example, a job whose jobs needs even more memory than average might influence tasks of other tasks that run on the same servant node due to a drain on system resources. Advanced Hadoop schedulers like the Capacity Scheduler and Fairshare Scheduler have tried to deal with the case of managing heterogeneous works on uniform resources utilizing advanced scheduling techniques. For instance, the Capability Scheduler supported the idea of ‘High RAM’ tasks– jobs that require even more memory generally. Such support is becoming more prevalent with Apache YARN, where the idea of a resource is being more adequately specified and dealt with. However, these solutions are still not as extensively embraced as the steady Hadoop 1.0 options that do not have this level of support.

Cloud options at the same time currently offer a choice to the end user to provision clusters with various kinds of devices for various kinds of workloads. Without effort, this appears like a a lot easier option for the problem of managing variable resource requirements.

For instance, with Amazon Elastic MapReduce, you can introduce a cluster for yourself with m2.large machines if your Hadoop tasks need even more memory, and c1.xlarge devices if your Hadoop jobs are calculate extensive.

5. Running Closer to the Data

As businesses move their services to the cloud, it follows that data starts living on the cloud. And as analytics thrives on data, and usually big volumes of it, it makes no sense for analytical platforms to exist outside of the cloud resulting in inefficient, time consuming migration of this data from source to the analytics clusters.

Running Hadoop clusters in the same cloud environment is an obvious solution to this problem. This is, in a manner, using Hadoop’s principle of data locality at the macro level.

6. Simplifying Hadoop Operations

As cluster consolidation takes place in the business, something that gets lost is the seclusion of resources for various sets of users. As all user tasks get bunched up in a shared cluster, administrators of the cluster start to deal with multi-tenancy problems like user jobs disrupting one another, varied security restrictions etc

. The common option to this issue has been to enforce really limiting cluster level policies or limits that avoid users from doing anything dangerous to other users jobs. The problem with this method is that valid use cases of users are also not fixed. For example, it prevails for administrators to lockdown the amount of memory Hadoop tasks can keep up. If a user really needs even more memory, he or she has no support from the system.

Using the cloud, one can arrangement different types of clusters with different qualities and setups, each suitable for a certain set of tasks. While this frees administrators from having to handle complicated policies for a multi-tenant environment, it makes it possible for users to make use of the right configuration for their tasks.
GeoViz is a team of experienced technical and business professionals that help our customers to achieve their ‘Operations and Maintenance Performance Management’ goals. Our experts minimize inefficiencies 360 degrees focusing Assets, Processes, Technology, Materials, People, Infrastructure, and Energy. GeoViz serves client inside North America specifically USA and Canada while physically serving clients in the cities of Seattle, Toronto, Buffalo, Ottawa, Monreal, London, Kitchener, Windsor, Detroit. Feel free to contact us or Drop us a note for any help or assistance.


Drop Us A Note

[gravityform id=”2″ name=”Drop us a Note” title=”false” description=”false” ajax=”true”]

Post a comment