Activeeon Team Blog: 2016

Tuesday, December 20, 2016

Resource Reservation

Each IT infrastructure includes heterogeneous resources which serve different purposes. Some resources might be more suited for RAM consuming tasks, other might include a secured database, other offer low latency to the customer and other might have bigger bandwidth.

This article will take as an example the need to access a database or any specific resource for a given task. A selection script will ensure this task will be executed on the server hosting this resource but other requirements have to be considered.
What if there are other tasks waiting and yours has higher priority? What if your task requires the whole capacity of the machine?

Big Data, Languages and Software solutions

Inspired from an article from ThinkR

Big Data is a common term that everyone uses without really having a clear definition of what it is. It is used everywhere and in every industry from transportation, human resources, healthcare, etc. which makes it more difficult to specify. For this article, let’s agree that big data is when it is not possible to handle the information with a single computer even though these days, computers are available with 1Tb of ram. In some more and more common cases, all the data for a computation might not even be in one place which requires advanced distributed algorithms and software.

Big Data for Hadley Wickham

Hadley Wickam well known developer for the R language, who developed multiple librairies used in most projects, defined 3 main categories for Big Data:

ProActive Calendar Step-by-Step

Calendars offer a nice user friendly interface which allow to plan ahead, manipulate events, duplicate them and much more.

By extending traditional applications, more than just events can be handled.

Introducing our calendar service. With this, ProActive Workflows&Scheduling offers more flexibility and a way to automatically synchronize the scheduler with your favorite calendar manager (Google Calendar, Thunderbird, Outlook, Apple Calendar, etc). This offer a new UI adapted for repetitive tasks, enabling more intuitive control through better visualization and simplified job handling (drag&drop). To use our calendar service, navigate to the scheduler, click on “calendar”, retrieve the url (generate it if needed) and use it to create a new calendar. It is possible to test this through our free online test platform.

Risk Model Optimization and Traceability for Solvency II

Solvency II and Basel III, the new standards for insurance and bank regulations, codify and unify capital, data management, and disclosure requirements for insurers and banks in order to protect companies and consumers from risk.

More specifically, with Solvency II the effect of every insurance contract capital market relation, optionally and risk source has to be modeled and measured. The goal is to ensure insurances have enough equity to cope with any forecasted risk. Considering the amount of heterogeneous assets and liabilities owned by an insurance company, algorithms such as Monte Carlo are suitable to estimate the result of different scenarios with a given accuracy and get metrics such as VaR (value at risk).

A typical process using a Monte Carlo algorithm regularly involves multiple steps:

Defining a model for each asset, liability, economic scenarios, etc. using tools such as Quantlib, Bloomberg solution, Wall Street system, Apollo, etc.
Create scenarios based on these models (It can easily go above 2 millions scenarios)

New Feature! Preview Intermediary and Final Results.

The “Task Preview” feature allowed you to access the results of individual tasks as soon as the task ended. It worked fine for common types but could be cumbersome for more complex cases. This is why we decided to improve it.

Goodbye Task Preview long live Task Result!

Now (as of Proactive Workflow&Scheduling 7.20), each task comes with a resultMetadata map, related to the “result”, which can contain the following informations:

file.extension,
file.name,
content.type specifying how the browser should open the result.

After the task’s execution, the result can then be opened in a browser or downloaded from the Preview tab in Scheduler.

Friday, November 11, 2016

Optimized Algorithm Distribution

With the growth rate of devices and sensors, analyzing collected data becomes more complex. Data scientists and analysts have to deal with heterogeneous sources and leverage distributed environment to improve analysis time. To achieve this, the ability to split a complex algorithm into smaller units is fundamental to take advantage of distributed environments. This article will detail ActiveEon’s approach.

To distribute efficiently with ProActive, ActiveEon has defined a task as the smallest unit of work that can be executed. A composition of tasks is called a workflow and includes additional features such as dependencies, variable propagation and shared resources. By following this principle, distribution can be taken into consideration by design since dependencies are explicit. The workflow designer is therefore in control of the distribution and can heavily optimize his/her resource utilization and reduce the execution time.

Terminology

A task is the smallest unit of work.
A workflow is composition of tasks which includes additional features such as dependencies, variable propagation and shared resources.
A job is a workflow that has been submitted to the scheduler. (Multiple jobs can be created from the same workflow.)

It is important to distinguish Workflows and Jobs. A Workflow is a directed graph of Tasks where Tasks represent the smallest units of work that can be distributed and executed on nodes. A Workflow can be seen as a model or a template. A Job is an instance of a Workflow which has been submitted to the Scheduler and is managed by this last. Jobs differ from Workflows in many ways: Jobs may have variables value update at runtime, controls such as loops which are expanded at runtime, etc.

Automated Workflow Translation to modern and Open Source Schedulers

Digital Transformation and Migration to the Cloud with Open Solutions

In the past couple of years new trends emerged in IT, more precisely, it could be noticed that companies are investigating new ways to leverage their Big Data, hybrid cloud infrastructure and IoT devices. This trend is leading to new business requirements which are redefining IT architecture with greater focus on flexibility and automation.

Current legacy software struggles to answer those needs which makes companies investigating open solutions. The value is evolving from infrastructure optimization to include platform connectors and Open Rest Api. Consequently, open source solutions have an edge since they offer a comprehensive communication system which allows a complete integration with existing tools. In conclusion, a migration towards open solution will ease IT digital transformation, migration to the Cloud, etc. and support future business needs.

To support the migration from one scheduler to another, ActiveEon offers its service to automatically translate workflows between Control-M (BMC), Autosys (CA), Dollar Universe (Automic), One Automation (Automic) ProActive (ActiveEon), etc.

This service will particularly benefit big companies with thousands of workflows such as Banks, Insurance companies, Financial institutions, Telecoms, Government agencies, etc. Indeed, at this scale, automation is advised for a quick migration process.

Migration Tool

Leverage Hybrid Infrastructure with SPOT instances or Preemptible VMs

Cloud computing allows companies to accelerate their business processes, optimize infrastructure costs and scale more quickly. However, integrating these new services with an existing infrastructure could be complex and not fully leverage this new opportunity. This article focuses on unstable instances like SPOT (AWS) instances or Preemptible VMs (GCP) which offers cheaper computational power.

What is a SPOT instance and a Preemptible VM?

AWS offers a service called Spot instances. It allows customers to bid on unused EC2 resources in any availability zone. GCP offers a similar service called Preemptible VMs. Customers can then use compute capacity with no upfront commitment at an hourly rate lower than the on-demand rate. The main drawback is that GCP and AWS can withdraw this instance at any time with little upfront warning depending on the market price of the resource and the bidding price.

Workloads Requirements

As explained in those two descriptions, instances can be withdrawn from customers at any time. The workloads leveraging the computing capacity require an advanced error management tool to support uncertainty on the lifecycle, otherwise any previous work will be lost and defeat the overall purpose. These workloads also require to be split in smaller tasks to ensure computation can be perform in a short period of time. This will ensure any work done to be saved for other dependent tasks.

When to leverage those?

One the common use cases of this services is when time is less a constraint than cost. For instance, a workload running at night could take as much time as required as long as it is completed at 8am.

Another common use case is when demand spikes in the current infrastructure occur which generate a large queue. This could often be seen in R&D environment using a single HPC or a limited infrastructure. In that case, time constraints have to be balanced with price. Spot instances offer the ability to unload the queue with cheaper than on-demand price instances.

Some business applications require a stable environment to perform efficiently. However, this stable resource might be taken when needed. Leveraging these unstable resources will enable to free up future stable resources beforehand.

Many other use cases can be found on GCP and AWS websites or on the Netflix blog (e.g. Netflix Blog)

What offers ProActive in these situations?

Reliable Execution in each Environment with Docker and ProActive

Using Docker containers allows the code to be independent from the underlying environment. Being heavily multi-platforms, ProActive offers different way to manage dependencies including the use of containers, detailed in this article.

How to start a container in ProActive

In ProActive, launching a task within a container is a simple 3 steps:

go to the “Fork Environment” tab,
select on the Java Home field the path to the Java installation directory on the node,
finally in the environment script field, fill the “preJavaHomeCmd” variable with the docker command that will start the appropriate container. For instance, preJavaHomeCmd = ‘docker run -p 80:8080 -rm java’.

After that, you’re done.

Let’s start with an example

A picture paints a thousand words. Below is an example based on a previous article that will be containerized. It is possible to follow those instructions in try.activeeon.com or after installing ProActive on your server.

Error Management Best Practices in Production?

In various situations, advanced error management tools save time and money. For instance, in production where problems must be solved without human intervention to ensure high-availability, unstable environment like SPOT instances where errors are commonplace and disrupting, as well as resources-intensive workflows which mustn't be run twice due to their cost.
To manage instabilities, ProActive offers many functionalities helping you with errors.

Functionalities and Support

The Advanced Error Management feature provides multiple options in case of failure. First, the number of attempts for any given task can be selected with the ability to choose whether to run subsequent iterations on a different node. Then,

How to accelerate your R algorithm with ProActive?

When analyzing sets of data, getting “instant” insight will often be critical for business decision. Which is why one may want to allocate all his/her resources on a set of tasks delivering analysis for stakeholders. Here is an example of a Twitter feed analysis using R (statistical oriented language) and ProActive. It gathers messages, then performs distributed analysis on distinct computing nodes.

How it works

In selected cities, this workflow will search Twitter feeds containing a chosen set of keywords and return a chosen number of tweets. (Due to restrictions from Twitter, the number of towns cannot be greater than 15: “Rate limits are divided into 15 minute intervals [...] 15 calls every 15 minutes.”, https://dev.twitter.com/rest/public/rate-limiting)

ProActive first creates as many tasks as the number of cities then run them in parallel using ProActive nodes. In those R-based tasks, each node will:

connect to Twitter,
perform the desired search,
analyze the information,
store the result in the ProActive dataspace (a shared storage environment adapted to distributed systems).

A last task analyzes all the data before storing them in the above-mentioned dataspace.

Let's code

Deploy a ProActive cluster using Docker

The goal of this blog post is to have an overview of how easy it is to deploy a ProActive cluster and what are the benefits you can gain from it.

Docker containers are a great way to deploy and re-deploy a pre-configured ProActive cluster quickly.

As you can see on the figure above, a ProActive cluster is composed of three different components:

the ProActive Scheduler Server

the database
the ProActive Nodes

So, all these components will be set up in Docker containers thanks to a Containers Orchestrator.

First of all, we suppose that you have several hosts for Docker containers, and that you use a orchestrator for your Docker containers. For instance, it could be Swarm, Mesos, Kubernetes. So you have a way to abstract network in your cluster thanks to Docker network overlay if you use Swarm (1), any Mesos network (2) or Kubernetes network (3).

The protocol that will be used for communication between ProActive Nodes and the ProActive Scheduler Server will be by default PNP but there are several other protocols available, PNPS, PAMR (ProActive Message Routing)

Once you are sure that you have a network that allow your containers to ping each other across hosts, you can, at first, run your Database container. Obviously, you can save data thanks to Docker Volumes and configure users for the Scheduler Server at the Runtime.

The second step is to launch the Scheduler Server Container and link it to the Database Container. If you access the Web UI on the Docker Host, thanks to the port redirection, you can notice that there is no Node running in the Resource Manager portal. This is the normal behaviour, indeed, our goal is to have Nodes running in others containers.

Finally, the last step is the deployments of the Nodes. You just have to configure them to connect to the Resource Manager and chose how many workers you want per Node. You can launch as many Nodes as you want, on as many hosts as you want.

Obviously, you can also keep data and logs for Nodes and Scheduler Server, in Docker Volumes.

Once everything is running, you can write some Workflows, execute them, look on which nodes these are executed.

And here you are, now, you have an entire cluster ready to execute some jobs and enjoy all the benefits of our generic Scheduler which allow you to run Spark jobs, MapReduce jobs, ETL processes…

Friday, August 26, 2016

Submitting ProActive Workflows with Linux cURL

Using the cURL in a linux command line (bash) is a very convenient and resource efficient way to submit workflows.

We need to login and store the current session id with the command:

sessionid=$(curl --data 'username=admin&password=admin' \ http://localhost:8080/rest/scheduler/login)

One can login with curl using username and password as header parameter, transmitted with -H. The result is written into the sessionid variable. The session id can be displayed with echo $sessionid.

Workflows can be submitted with cURL:

curl --header "sessionid:$sessionid" \
--form \ "file=@filename.xml;type=application/xml" \
http://trydev.activeeon.com:8080/rest/scheduler/submit

The session id variable is inserted into the header and the @ notation allows to send files directly to the server.

Advanced: Workflow Submission with Variables

Workflow variables can be send in the submission URL. Those variables will be replaced.

curl --header "sessionid:$sessionid" \

--form "file=@file.xml;type=application/xml" \

“http://trydev.activeeon.com:8080/rest/scheduler/submit;variable1=test1;variable2=test2”

Important: the URL is now embedded in double quotes "", only then the matrix parameters are properly transferred. Variables are separated by semicolon ;

Links

Tuesday, December 20, 2016

Thursday, December 8, 2016

Big Data for Hadley Wickham

Monday, December 5, 2016

Tuesday, November 29, 2016