Friday, March 24, 2017

Orchestration and Meta-Scheduling for Big Data Architecture

MAPREDUCE, HADOOP, ETL, ELT, INFRASTRUCTURE, CLOUD, WORKFLOW, META-SCHEDULER AND COST OPTIMISATION

In today’s world, the adoption of Big Data is critical for most company survival. Storing, processing and extracting value from the data are becoming IT department's’ main focus. The huge amount of data, or as it is called Big Data, have four properties: Volume, Variety, Value and Velocity. Systems such as Hadoop, Spark, Storm, etc. are de facto the main building blocks for Big Data architectures (e.g. data lakes), but are fulfilling only part of the requirements. Moreover, in addition to this mix of features which represents a challenge for businesses, new opportunities will add even more complexity. Companies are now looking at integrating even more sources of data, at breaking silos (variety is increasing with structured and unstructured data), and at real-time and actionable data. All those are becoming key for decision makers.

Multiple solutions in the market have been supporting Big Data strategies, but none of them fits every company’s use cases. Consequently, each of these solutions will be responsible for extracting some meaning from the data. Although this mix of solutions adds complexity to infrastructure management, it also leverages the full information that can be extracted from the data. New questions are then raised like: How do I break company silos? How to make sense of this pool of unrelated and unstructured data? How to leverage supervised and unsupervised machine learning? How do I synchronize and orchestrate the different Big Data solutions? How can I allocate relevant resources? How do I ensure critical reports get prioritized? How do I enforce data locality rules and spread of the information? How do I monitor the whole data journey?

This article highlights two points on technical, operational and economic challenges around orchestration solutions. To leverage Big Data, companies will address those in order to optimize its infrastructure, extract faster and deeper insight into the data, and thus get a competitive edge. For a more detail and complete document, do not hesitate to download the full white paper.

Ensure High Utilization

On the path to leverage Big Data, optimizing resource utilization is critical. This can be achieved by programming job dependencies, allocating resources to each task, and by managing globally the overall resource pool.

As Adrian Cockcroft, while cloud computing leader at Netflix, said “If you build applications that assume the machines are ephemeral and can be replaced in a few minutes or even seconds, then you end up building an application that is cost-aware”. Cost savings can then be achieved using an orchestration tool which includes a resource manager. The solution will be aware of future resource needs and adequately organize the pool of resources to match future demand. IT will be enabled to configure smart behaviors within their system such as cloud bursting to collect additional resources when required, resource failure management to reschedule tasks and try to recover failing resources and resource allocation optimization to parallelize tasks using different resources (e.g. CPU and GPU), etc. (Fig. 1).

Most solutions such as Hadoop, Spark and Storm with embedded resource manager do not have a global vision of the infrastructure and will benefit from a global scheduler or as it is called meta-scheduler. Indeed, a meta-scheduler with a resource manager enables resources to be balanced according to the overall processes running to ensure higher utilization of the overall resource pool and meet individual job deadlines.

A CPU consuming application, such as a risk calculation on portfolio exposure to events, can run in parallel with applications which are RAM intensive, such as merging different exposure risks. This allows for a more efficient use of a specific machine, with two or more tasks running in parallel.

A Hadoop job completed will then allow a meta-scheduler to plan another job on the available resource depending on resource requirements, dependencies, priorities, results returned, etc.


Fig 1: Optimized scheduling according to resource utilization

Orchestration Benefits

Getting insight from a large amount of collected data is complex and multiple parameters have to be taken into account. Various solutions from Hadoop to Storm offer ways to partially extract information along the data journey and answer various challenges (volume, variety, value and velocity). New tools are also coming into this ecosystem to add metadata, tag, prepare data, offer SQL access, etc.

To organize all of these solutions and optimize parallelization, an orchestrator or meta-scheduler is required (Fig. 2). This will pilot diverse applications to ensure the process flow is respected, ensure the data follow government and company policies (e.g. data locality), handle error management and ensure full integration with other solutions such as BI tools, reports, etc. Moreover, to overcome challenges faced by each individual solution, orchestrators enable secured data transfers, enable resource selection through firewalls, balance the overall load efficiently, etc.


Fig 2: Big Data simplified ecosystem and data journey

And More…

Additional value is brought by an orchestration tool and meta-scheduler such as:

  • Optimizing Operational Costs with an Error Management System
  • Optimizing Operational Costs with Single Pane Dashboard
  • Getting Highly Critical Jobs Prioritized
  • Securing Your Data and Enforcing Compliance
  • Ensuring Data Integrity
  • Etc.
For more detail download our complete white paper.

Role of ProActive in Big Data area

In this Big Data ecosystem, ProActive Open Source solution fits into two main areas.

ProActive has proven records in processing optimization (accelerating Workload completion) through distribution and parallelization which makes it suitable for long and complex analysis. By managing closely the diversity of resources available for a company or business unit and by understanding algorithm dependencies and requirements, businesses are getting insights into their data faster, at a cheaper cost while keeping control of the execution. Multiple languages are supported such as R and Python which are the most common languages used to extract deeper information from the database.

ProActive has also been used as a meta-scheduler and orchestrator for advanced architectures which have to balance security rules, fast processing, information accessibility, governance, and third party software interactions. It provides the ability to optimize data transfers through advanced workflows and resource selection, including through layer 3 networks and firewalls (Fig.3). Its global view on the architecture enables load balancing and secure synchronization of multiple processes.


Fig 3: Meta-scheduling of Big Data applications over secured environments

Finally, the Open Source approach followed by ActiveEon means greater flexibility. This eases integration with existing IT architectures and support integration of future new technologies.

No comments:

Post a Comment