Big Data Is More Than Hadoop
Many people today associate big data with Hadoop analytics – and, to be sure, Hadoop is an important technology for this. However, just as the Linux operating system is so much more than just the Linux kernel, a big data environment is so much more than just a Hadoop cluster.
In fact, from a Red Hat perspective, you need three primary things for a robust big data deployment:
- Big data infrastructure
- Big data analytics tools
- Big data application platform
Red Hat is working to deliver enterprise big data solutions that integrate these three areas across an open hybrid cloud. In this post, I’ll focus on how Red Hat plans to deliver a scalable, cloud-based big data infrastructure.
Big Data Infrastructure
Big data infrastructure needs to provide scalable compute and storage infrastructure, and an open hybrid Infrastructure-as-a-Service (IaaS) cloud provides an ideal architecture for this. Here are the elements Red Hat is building out to run big data workloads in the cloud:
Spinning up compute capacity in the cloud is important to big data – and I’ll explain more about this below – but first and foremost, big data requires scalable data storage that grows alongside compute.
Red Hat Storage provides scale-out storage that can extend into an open hybrid cloud. Leveraging the GlusterFS distributed filesystem, here is how it does so:
- Red Hat Storage is a pure software solution that runs on top of standard Red Hat Enterprise Linux with the XFS filesystem. This means Red Hat Storage can run anywhere Red Hat Enterprise Linux runs—including across physical systems, virtualized infrastructure and private or public clouds
- Red Hat Storage provides a global namespace, even across multiple data centers and across hybrid clouds. This allows a hybrid cloud in which virtual machines in a public cloud can operate on the exact same data as virtual machines in a private cloud
- Red Hat is working on a Red Hat Storage Hadoop plugin that it will contribute to the Apache Hadoop community. As a result, big data workloads with Hadoop analytics will be able to leverage Red Hat Storage as the underlying data store and span across hybrid clouds
Red Hat is also a leader in the OpenStack IaaS project and is working to deliver an enterprise OpenStack distribution to market (currently available as preview to anyone with a Red Hat Enterprise Linux subscription). OpenStack aims to provide the ability to build a large private cloud that can host big data compute workloads. As big data compute needs of an organization grow, OpenStack will be able to elastically expand cloud-based computing capacity through the provisioning of new virtual machines.
In order for OpenStack compute capacity to adjust dynamically according to big data needs and policy, though, it needs cloud operations management tools. Red Hat’s recent acquisition, ManageIQ, provides these capabilities. ManageIQ includes rich monitoring and analytics tools to determine what is happening to cloud infrastructure. For example, it can determine when a particular cloud provider is saturated in certain resources. ManageIQ also includes the ability to create policies and provides orchestration tools to automate responses to events and policies. As Red Hat introduces its OpenStack product to market, it is also working to add support for OpenStack to ManageIQ. Combined, these capabilities will enable an enterprise to leverage ManageIQ’s features to auto-flex OpenStack-based capacity for big data computations.
As large as today’s data centers are, a single one is often not enough for for big data workloads. Data can also reside in more than one place—requiring that associated computing does as well. As a result, many enterprises span multiple data centers as well as private and public clouds. Red Hat’s CloudForms product aggregates multiple, disparate providers into uniform hybrid clouds. By leveraging CloudForms on top of OpenStack as well as public clouds, enterprises can deploy a big data compute platform that scales, not just within one OpenStack deployment, but across an entire hybrid cloud spanning multiple data centers. Furthermore, because CloudForms aggregates capacity across a variety of different cloud technology providers such as Red Hat, VMware, and Amazon AWS, enterprises can use both existing and new compute capacity without being locked into a single technology provider or platform. Red Hat is in the process of integrating ManageIQ and CloudForms into a next-generation version of CloudForms. This single cloud management platform is designed to be able to aggregate and operate across open hybrid clouds in one interface.
Open Hybrid Cloud Infrastructure for Big Data
Now let’s bring it all together. Here’s how Red Hat plans to bring its scalable compute and storage capabilities together in one open hybrid cloud:
- Because Red Hat Storage can run in a virtual machine, we can make it available as a resource both in OpenStack and in a public cloud
- As CloudForms and ManageIQ orchestrate the scaling out of compute capacity in an open hybrid cloud, they can simultaneously do so for storage capacity as well by spinning up additional virtual machines running Red Hat Storage
- All this compute and storage can work seamlessly together across data center and firewall boundaries, because Red Hat Storage provides a global namespace
Big Data: An Ideal Workload for Open Hybrid Cloud
Big data, by its very nature, requires big, scalable infrastructure to run. An open hybrid cloud that spans multiple resource providers in private and public clouds, while simultaneously scaling out both compute and storage capacity, provides an extremely powerful platform for big data workloads. Red Hat is focused on delivering this type of infrastructure for enterprises to run big data—and all their other workloads—across an open hybrid cloud.
In follow-up posts, I’ll discuss why an open hybrid cloud makes sense for big data analytics and big data application platforms and how Red Hat is working to deliver those as well.