What is Hadoop?
Hadoop is an Apache open source project that provides scalable and distributed computing, originally built by Yahoo!. It provides a framework that can process large amounts of data by leveraging the parallel and distributed processing of many compute nodes arrayed in a cluster. These clusters can be configured as a single host or scaled up to utilize thousands of machines depending on the workload.
What are Hadoop Components?
These are the core modules of Hadoop, which build the capabilities to conduct distributed computing capabilities.
- Hadoop Common – The utilities that support the other Hadoop modules.
- Hadoop Distributed File System – The distributed file system used by most Hadoop distributions . Also known by its initials, HDFS.
- Hadoop YARN – Used to manage cluster resources and schedule jobs.
- Hadoop Map Reduce – YARN based system of processing large amounts of data.
In addition to the core modules, there are others that provide specific and specialized capabilities to this distributed processing framework. These are just some of the tools:
- Ambari – A web-based tool for provisioning, management, and monitoring of Hadoop clusters.
- HBase – Distributed database that supports structured data storage.
- Hive – Data warehouse model with data summarization and ad hoc query capability.
- Pig – Data flow language.
- ZooKeeper – Coordination service for distributed applications.
These are modules available from the Apache open-source project, but there are also more than thirty companies that provide Hadoop distributions that include the open-source code as well as adding competing management solutions, processing engines, and many other features. Some of the best known and widest used are distributed from Cloudera, MapR, and Hortonworks.
Why do we need to Virtualize Hadoop workloads?
Now, after we know about Hadoop. We always discuss about virtualization in this blog. Is hadoop suitable to be virtualized? Yes, if you would like to have these additional values for Hadoop. Then you should consider to virtualize the workload.
- Better resource utilization:
Collocating virtual machines containing Hadoop roles with virtual machines containing different workloads on the same set of VMware ESXi™ server hosts can balance the use of the system. This leads to lower operating expenses and lower capital expenses as you can leverage the existing infrastructure and skills in the data center and you do not have to invest in bare-metal servers for your Hadoop deployment.
- Alternative storage options:
Originally, Hadoop was developed with local storage in mind, and this type of storage scheme can be used with vSphere also. The shared storage that is frequently used as a basis for vSphere can also be leveraged for Hadoop workloads. This re-enforces leveraging the existing investment in storage technologies for greater efficiencies in the enterprise.
This includes running different versions of Hadoop itself on the same cluster or running Hadoop alongside other applications, forming an elastic environment, or different Hadoop tenant. Isolation can reduce your overall security risk, ensure you are meeting your SLA’s, and support Hadoop as a service back to the lines of business.
- Availability and fault tolerance:
The NameNode, the Resource Manager and other Hadoop components, such as Hive Metastore and HCatalog, can be single points of failure in a system. vSphere services such as VMware vSphere High Availability (vSphere HA) and VMware vSphere Fault Tolerance (vSphere FT) can protect these components from server failure and improve availability.
- Balance the loads:
Resource management tools such as VMware vSphere vMotion® and VMware vSphere Distributed Resource Scheduler™ (vSphere DRS) can provide availability during planned maintenance and can be used to balance the load across the vSphere cluster.
- Business critical applications:
Uptime consideration is just as important in a Hadoop environment, why would the enterprise want to go back in time to a place where the servers and server components were single points of failure. Leverage the existing investment in vSphere to drive meeting SLA’s and providing an excellent service back to the business.
VMware also have the component called VMWARE BIG DATA EXTENSIONS (https://www.vmware.com/products/big-data-extensions), to rapidly deploy High Available Hadoop components and easily manage the infrastructure workloads
vSphere Big Data Extensions enables rapid deployment, management, and scalability of Hadoop in virtual and cloud environments. It also has the functionality to do scale in/out capabilities built into Big Data Extensions tools enables on-demand Hadoop instances.
Simple cloning to sophisticated end-user provisioning products such as VMware vRealize Automation™ can speed up the deployment of Hadoop. This enables IT to be a service provider back to the business and provide Hadoop as a service back to the different lines of business, providing faster time to market. This will further enable today’s IT to be a value driver vs. seen as a cost center.
For more detail about VMware Big Data Extensions, please see this datasheet from VMware Inc. = https://www.vmware.com/files/pdf/products/vsphere/VMware-vSphere-Big-Data-Extensions-Datasheet.pdf