Activeeon Team Blog: 2015

Wednesday, November 25, 2015

ProActive in Docker containers

ProActive Workflows & Scheduling 7.0.0 is available on the Docker hub, now!

How to run ProActive Workflows & Scheduling inside Docker

1) Have Docker and Docker Compose installed
2) Get the ProActive Docker Compose file here
3) Save the ProActive Docker Compose file as docker-compose.yml
4) Run "docker-compose up" in the same directory as the ProActive Docker Compose file
5) Wait for the "*** Get started at http:/[IP]:8080 ***" message
6) Browse to http://[IP]:8080
7) Login with default login: admin:admin

Get 10 days free support: registering through activeeon.com

Be part of our open source community on github

Tuesday, November 24, 2015

Add Docker to your legacy systems with ProActive

ProActive Workflows & Scheduling 7.0 is out! Which has an exciting new feature, Docker containers are supported inside Workflows.

tl;dr:

Docker container support > now < in ProActive Workflows & Scheduling 7.0. It combines the strength of supporting, keeping and combining legacy systems with the newly added Docker containers support. That enables everyone to combine legacy systems with new cutting edge technology/systems.

Get access through:

containers.activeeon.com

What is ProActive Workflows & Scheduling?

Quick introduction to our:

Workflow based Scheduler
Resource Manager
Web interfaces

Quick introduction

Studio

Our Workflow Scheduler allows to compose bigger tasks out of smaller components, by creating Workflows. Workflows consist out of many languages and can contain services and batch tasks. So they provide a lot of freedom in solving problems, collecting and analyzing data and much more. The best part about it is, it is automatically distributed, but more about that in the Scheduler section.

Scheduler

The Scheduler knows all about jobs and tasks, a ProActive Workflow is represented as a job and the Workflow's building blocks are called tasks. It communicates with the Resource Manager to distribute all the work. Resources are maximally utilized because the Scheduler knows what needs to go first for an efficient resource utilization, while not breaking the constraints of the Workflows. So your workload is automatically distributed and the advanced fault tolerance ensures error-recovery after hardware and software failures.

Resource Manager

The Resource Manager knows all resources and everything about them. Resources can be removed and added at runtime, ProActive’s fault tolerance will ensure proper execution of Workflows even if nodes are removed or fail. Best about it is: it is very easy to scale, just one command adds or removes new resources.

Many more features

ProActive Workflows & Scheduling has many more features than introduced in the speedy introduction. If you are interested in something particular, contact us or visit our website.

Legacy systems? - ProActive has you covered

ProActive Workflows & Scheduling is a perfect fit with your legacy system. We at Activeeon have many years of experience with bringing legacy systems together and adding new functionality to the mix. Legacy systems can be re-written inside the ProActive Studio or can be accessed inside Workflows, without touching them. Due to the flexibility of Workflows, new systems can be added just fine.

Working many years with customers brought a wide support of languages for our Studio. Including:

Python
Bash
R
Java
Ruby
And more....

Our ProActive Workflows allow you to combine all types of languages in one Workflow, to reach your goal faster, and without changing your working legacy code.

If you are familiar with High Performance Computing, you might know:

SLURM
IBM Platform LSF
PBS Pro

We connect all of them and combine access in our single web interface.

If you have worked with legacy systems, you know that you better keep them running as long as they do, because changes can be expensive.

But Docker has shaken the IT world and made many things easier and faster than before, and now, with ProActive Workflows & Scheduling you can use the newest technologies next to your legacy systems.

Docker support in Workflows & Scheduling

Finally the day arrived to announce Docker support inside ProActive Workflows & Scheduling.

It gives you many advantages:

Advanced ProActive Workflows & Scheduling fault tolerance for your Docker containers
Run Docker and legacy systems without updating existing code
Improve your resource utilization significantly

ProActive Workflows & Scheduling brings advanced fault tolerance features to Docker containers:

Resilient to hardware failures
Resilient to software errors

Re-execute your Docker containers automatically on different hardware if one or more fail.

If your software experiences errors, re-execute X times in Y different environments.

Run Docker containers alongside legacy systems:

Integrate Docker containers in your legacy system mix
Don't touch any legacy system
ProActive Workflows & Scheduling combines legacy and new systems

Improve your resource utilization significantly

Docker containers, in a distributed environment, need efficient resource management, otherwise machines will be over- or under-utilized.

ProActive Workflows & Scheduling has an effective resource management which reaches high efficiency, and scales your infrastructure according to the current demand.

Want to have a try?

Open source and free!

No installation required, as a web-app or download at:

containers.activeeon.com

Tuesday, July 21, 2015

Portfolio risk diversification with Spark & ProActive Workflows

Authors: Michael Benguigui and Iyad Alshabani

In order to design a financial tool to diversificate risks, we describe the following specific spark jobs orchestrated by ProActive Scheduler. Since we need to process huge amounts of asset prices and achieve streaming k-means to keep models up to date, Spark (http://spark.apache.org/) is definitely the right technology. Firstly, MLlib (Spark’s scalable machine learning library https://spark.apache.org/mllib/) offers functionalities for streaming data mining. Secondly, Spark allows to work with advanced map/reduce instructions on RDDs (Resilient Distributed Dataset), i.e. partitioned collections of elements that can be operated in parallel. All these jobs required a powerful framework to be deployed, orchestrated and monitored, as provided by ProActive.

Spark streaming jobs

Spark Cluster

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers to allocate resources across applications: Spark’s own standalone cluster manager, Mesos (http://mesos.apache.org/), YARN (http://fr.hortonworks.com/hadoop/yarn/). Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.

ProActive Workflow, Scheduler & Resource Manager

Spark jobs are created and deployed using Proactive Workflows, offering task templatization to facilitate workflow elaboration. Indeed, Proactive Workflow & Scheduling is an integrated solution for deploying, executing and managing complex workflows in the cloud and on premise as well. All jobs are orchestrated by the ProActive Scheduler and the ProActive Resource Manager virtualizes and monitors resources.

The ProActive Scheduler and Resource Manager are deployed in a way that monitors each Spark Cluster which, in its turn, monitors its applications. The ProActive scheduler is used as a meta scheduler of the set of Spark clusters. Indeed, each Spark cluster has its own Spark scheduler. ProActive then orchestrates the set of clusters and jobs as shown with the next figure.

Architecture of our application dedicated to diversificate risks

Spark Streaming jobs

We focus on the CAC40 stocks correlations analysis over short periods. Indeed, our spark streaming jobs require a batch duration parameter to be set. This parameter controls the time period in seconds, during which the job stream input data, before processing. The main Spark jobs of our financial tool are described as follows:

The first streaming job queries data from Yahoo! or Google web services (user choice), cleans data for the following jobs, and writes data in a shared directory. The batch duration is set by the user, and only successive distinct quotes are stored. For instance, considering a 10s batch duration, quotes will be queried during 10s before being processed and written in a shared directory.

A second Spark streaming job keeps up to date the correlation matrix coefficients (“Spearman” correlations in MLlib, against the commonly used “Pearson” correlations not applicable, if we consider the price processes are lognormally distributed as the Black & Scholes model states) by streaming the processed stock quotes from the previous job.

A specific Scala job is in charge of drawing a heatmap from a matrix. Applied to the estimated correlations, it depicts the constantly updated correlation coefficients with associated colors.

CAC40 correlation heatmap

Relying on KMeans algorithm (MLlib), another Spark streaming job clusterizes correlations to build a well-diversified portfolio. KMeans clustering aims to partition a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Here, the drawing job is used for the clusterized correlations representation: companies with similar correlations (considering as many features/correlations as stocks, i.e. 40) are assigned to the same cluster.

CAC40 clusterized correlation heatmap

Perspectives

We could use resulting correlations to simulate the underlying stocks of an option, and moreover through Monte Carlo simulations, the delta greek to delta hedge it (i.e. find positions on the market to reduce the exposition of the underlying stocks variations). For sure, all embarrassingly parallel jobs can benefit from the mapReduce paradigm offered by Spark. In the automated trading context, it could be interesting to let ProActive automatically deploy such application involving specialized streaming spark jobs, and let ProActive dynamically orchestrate them: some streaming jobs would be specific to trading strategies, and other streaming jobs would be in charge of market data analysis. Proactive Workflow & Scheduling afford such dynamic deployment and orchestration, in a fully scalable way.

Thursday, June 18, 2015

Execute your parallel workflows through a simple REST API

Coming versions of ProActive Workflows & Scheduling will offer a feature that will make the integration between its API and your business software an even simpler task, specially if you care only about your business and its parallelization.

We observed from some of our clients the need to maintain a clear isolation between the workflow development process, and the workflow execution process. It is just that users that deal with the core of the business often care (almost exclusively) about the workflow execution and its results. They usually submit the same workflow again and again, simply changing their inputs.

Our scheduler has been adapted for such use cases. Now you can create the workflow once (see how easy it is with our ProActive Studio after you created a free account here), and execute it with parameters through an intuitive API. No need to have the workflow on you, no need to even know there is a workflow that parallelizes the execution.

How to use such API?

So simple… Use our REST API as follows to execute workflows available in the Studio templates.

Using bash and curl, first you need to log in:

# Log in and keep the session ID somewhere

$ sessionid=`curl -k -d "username=admin&password=admin" http://localhost:8080/rest/scheduler/login`

Using the session ID, you can now list all the available templates on the server (same can be done to obtain the list of private workflows, but using /rest/studio/workflows instead of /rest/studio/templates):

# List all available templates (the ones created with the ProActive Studio will be available too)

$ curl -X GET -H "sessionid:$sessionid" http://localhost:8080/rest/studio/templates/

RESPONSE:

[

{

"id": 2,

"metadata": "...",

"name": "Variable Propagation",

"xml": "..."

{

"id": 1,

"metadata": "...",

"name": "Pre-Post-Clean Scripts",

"xml": "..."

}

]

Now, let’s say we are interested in submitting our template 1, you can get more details about it by doing the following:

# Choose one template and get information about it
curl -X GET -H "sessionid:$sessionid" http://localhost:8080/rest/studio/templates/1

Or even get its XML content:

# Choose one template and get its xml content
curl -X GET -H "sessionid:$sessionid" http://localhost:8080/rest/studio/templates/1/xml

Now you can execute the workflow from the scheduler providing through the header “Link” the workflow URL we have been using:

# Submit the workflow to the scheduler
curl -X POST -H "Link: http://localhost:8080/rest/studio/workflows/1/content" -H "sessionid:$sessionid" http://localhost:8080/rest/scheduler/jobs/

And that is all!!!

I know, it is just so simple.

Hope you use it!

Wednesday, June 3, 2015

Using ProActive Scheduler from a Python application

For a collaborative project we are involved in, we had to integrate with a Python application. Fortunately this is made easy by the REST API of ProActive Scheduler.
We hope this piece of code will prove useful in integrating your Python application with ProActive!

In Python, it turns out the requests library is very to use to build HTTP requests. For instance here is how to retrieve the version of the Scheduler:

import requests

r = requests.get("http://try.activeeon.com/rest/scheduler/version")

print(r.text)

which outputs { "scheduler" : "6.1.0", "rest" : "6.1.0"}.

It is also very easy to manipulate JSON data with r.json() that returns a dictionary.

We put together a small project that should be considered more as an example rather than a full client for the Scheduler. It is available on Github:

example.py shows various interactions with the Scheduler API, logging in, submitting a job and waiting for it to finish.
scheduler.py contains the code to query the REST API

You can also take a look at our REST API documentation to see how it would be possible to implement more features in this small client.

Tuesday, May 19, 2015

Integrating services with Apache Camel

One of our clients came not long ago with a problem: a complex system with lots of information exchange between remote services… We simply did not know how to handle such situation in a neat and simple manner. We had to consider the different potential APIs and their protocols, what would be easier to integrate into existing blocks, clients implementations and their development status, maintenance… What was worse is that probably nobody would even understand the resulting architecture nor be able to maintain it easily. But after some head-scratching we found a lovely theoretical solution: messaging patterns, or, more precisely, Enterprise Integration Patterns (or EIPs for short).

Why are EIPs useful?

Usually enterprise environments are made of a bunch of services that must interact between them to provide another service, let’s say an integrated service. Thinking through this integration used to take lots of time and effort from engineers. But someone came up with an idea.

Yep, a great idea: well thought patterns to address these exact integration problems. EIP had just born.

EIPs are useful to avoid re-inventing the wheel by sticking to existing standards to produce better software in an simpler manner. Software that is more readable, more intuitive, more flexible, more documented.

More about EIP?

For EIP, information between different services has the shape of messages. If there are messages in the picture, there must be also producers and consumers. Along their way to the consumers, messages pass through different elements: queues, processors, routers, among others. These transform messages, route them, creates additional messages, etc. For instance this is how a route looks like.

This is a very simple example. There is a producer of messages, that will go through a translator element, and a router that will set their way to reach either Queue A or Queue B.

EIP establishes very intuitive patterns that allow interconnection of different services, from message producers until message consumers. Some popular known patterns are publish-subscribe, pipes and filters (or pipeline). There are some other lesser known patterns, for instance the very helpful Dead Letter Channel that specifies a way to treat messages that for some reason failed to be delivered.

Too theoretical, how do I proceed with my integration?

If you like making your own way through the jungle, you just need to read Enterprise Integration Patterns by Gregor Hohpe (2004) and implement whatever EIPs you need. Sticking to existing patterns is always a great idea for many reasons.

However, for those who like highways, there are is a very good framework I would like you to befriend. Camel: the reader. The reader: Camel.

How does Camel fit with my existing software? How could we use all this if our system already uses existent technologies? Adapting Camel to your scenario might not be as painful as you think.

What is really Camel?

Camel is an open source integration framework based on EIPs. Apart from providing the possibility to implement lots of EIP patterns, you can benefit from their more than hundred connector components, which makes it really easy for developers to integrate it to existing software.

Technologies supported by Camel?

More concretely, a component allows you to connect your EIP to specific endpoints technologies. For instance you use a local storage File component to generate one message per file in a given directory of your filesystem. It usually works two-ways for most of the components, which means that for instance using the same File component you can also dump messages into files in your filesystem.

Similarly, there are components for remote storage: FTP, FTPS, Dropbox and even S3 (Simple Storage Service) Amazon Services!

If planning to use queueing technologies, you’re not out of luck because Camel supports JMS and AMQP queues, and also AWS-SQS (Amazon Simple Queue Service).

Messages could be also generated from / dumped to items on different databases: SQL, Cassandra, Elasticsearch and CouchDB.

Some management connectors include JMX, SSH. Also there are some crazy connectors you might want to take a look at: Facebook (to access Facebook API), Docker (to communicate with Docker), Exec (for executing system commands).

What else?

Camel is an open source Apache project, developed in Java over Spring framework. Since its first commits in 2007, it is actively being improved in this repository by a growing community (now including me!). Don’t miss the chance to give it a try!

Happy integration!

Links