Tuesday, May 19, 2015

Integrating services with Apache Camel

One of our clients came not long ago with a problem: a complex system with lots of information exchange between remote services… We simply did not know how to handle such situation in a neat and simple manner. We had to consider the different potential APIs and their protocols, what would be easier to integrate into existing blocks, clients implementations and their development status, maintenance… What was worse is that probably nobody would even understand the resulting architecture nor be able to maintain it easily. But after some head-scratching we found a lovely theoretical solution: messaging patterns, or, more precisely, Enterprise Integration Patterns (or EIPs for short).

Why are EIPs useful?

Usually enterprise environments are made of a bunch of services that must interact between them to provide another service, let’s say an integrated service. Thinking through this integration used to take lots of time and effort from engineers. But someone came up with an idea.
Yep, a great idea: well thought patterns to address these exact integration problems. EIP had just born.
EIPs are useful to avoid re-inventing the wheel by sticking to existing standards to produce better software in an simpler manner. Software that is more readable, more intuitive, more flexible, more documented.

More about EIP?

For EIP, information between different services has the shape of messages. If there are messages in the picture, there must be also producers and consumers. Along their way to the consumers, messages pass through different elements: queues, processors, routers, among others. These transform messages, route them, creates additional messages, etc. For instance this is how a route looks like.



This is a very simple example. There is a producer of messages, that will go through a translator element, and a router that will set their way to reach either Queue A or Queue B.
EIP establishes very intuitive patterns that allow interconnection of different services, from message producers until message consumers. Some popular known patterns are publish-subscribe, pipes and filters (or pipeline). There are some other lesser known patterns, for instance the very helpful Dead Letter Channel that specifies a way to treat messages that for some reason failed to be delivered.

Too theoretical, how do I proceed with my integration?

If you like making your own way through the jungle, you just need to read Enterprise Integration Patterns by Gregor Hohpe (2004) and implement whatever EIPs you need. Sticking to existing patterns is always a great idea for many reasons.
However, for those who like highways, there are is a very good framework I would like you to befriend. Camel: the reader. The reader: Camel.
How does Camel fit with my existing software? How could we use all this if our system already uses existent technologies? Adapting Camel to your scenario might not be as painful as you think.

What is really Camel?

Camel is an open source integration framework based on EIPs. Apart from providing the possibility to implement lots of EIP patterns, you can benefit from their more than hundred connector components, which makes it really easy for developers to integrate it to existing software.

Technologies supported by Camel?

More concretely, a component allows you to connect your EIP to specific endpoints technologies. For instance you use a local storage File component to generate one message per file in a given directory of your filesystem. It usually works two-ways for most of the components, which means that for instance using the same File component you can also dump messages into files in your filesystem.
Similarly, there are components for remote storage: FTP, FTPS, Dropbox and even S3 (Simple Storage Service) Amazon Services!
If planning to use queueing technologies, you’re not out of luck because Camel supports JMS and AMQP queues, and also AWS-SQS (Amazon Simple Queue Service).
Messages could be also generated from / dumped to items on different databases: SQL, Cassandra, Elasticsearch and CouchDB.
Some management connectors include JMX, SSH. Also there are some crazy connectors you might want to take a look at: Facebook (to access Facebook API), Docker (to communicate with Docker), Exec (for executing system commands).

What else?

Camel is an open source Apache project, developed in Java over Spring framework. Since its first commits in 2007, it is actively being improved in this repository by a growing community (now including me!). Don’t miss the chance to give it a try!

Happy integration!

Wednesday, April 29, 2015

ProActive and Docker: schedule and orchestrate Docker with ProActive and Docker Compose + variable substitution

ProActive workflows and orchestration

The ProActive Scheduler is an open source software to orchestrate tasks in your hybrid cloud infrastructure. With an easy to use GUI and an easy setup on master and nodes (just executing one sh file or why not getting it in containers directly) it fits very well into the Docker ecosystem, which I find very user friendly.

Getting Docker Compose support into ProActive

The ProActive software uses the concept of JSR 223 script engines. I used that fact to implement a Docker Compose executing JSR 223 compatible script engine, which now drives the Docker Compose support. To add Docker Compose support simply add a jar file to your ProActive installation. You can find the Docker Compose script engine here, but don't worry it is already inside the containers we will use in this walk-through.

Running ProActive

You need to have docker 1.6  and Docker Compose installed, if not: follow the installation guides for docker and Docker Compose.

Run ProActive with Docker Compose! I prepared a yaml file (download it here):
image: tobwiens/proactive-scheduler
  ports:
    - "8080:8080"
  expose:
    - "64738"
proactiveNode:
  image: tobwiens/proactive-node
  command: -r pnp://proactiveScheduler:64738
  links:
    - proactiveScheduler
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock

The ‘proactiveScheduler’ start the ProActive master with a web interface which will be accessible on port 8080. Additionally it exposes the port 64738, to which the ProActive node will connect. The scheduler manages the nodes and keeps track of the tasks.
The node will execute and monitor tasks, so we need at least one node to execute our Docker Compose tasks.
The ‘proactiveNode’ is linked to the proactiveScheduler and connects with the command ‘-r pnp://proactiveScheduler:64738’; check out the node's and scheduler's dockerfile.
The docker.sock is mounted inside the container to execute containers on your local docker daemon.
Now: Execute the yaml file!

Executing a Docker Compose script with the ProActive GUI

After executing the yaml file wait until the web interface is started and go to http://127.0.0.1:8080/studio and log in with the standard account (admin/admin). Click on create, to create a new workflow:

After creating your first workflow click on open (green button):


Now that the workflow is open: import the Docker Compose task by clicking import; download it here.
Note: The Docker Compose script engine is not selectable by the GUI, which will change soon, so for now, the xml files have to be edited in a text editor and imported.
More details: The GUI checks whether the script engine is valid as soon as the script is edited, that will remove the language="docker-compose" in the xml file.

After importing you will see a task which contains a Docker Compose script: 
ubuntuEcho:
  image: dockerfile/ubuntu
  command: /bin/bash -c 'sleep 5 && echo It worked!!!!'
It executes (in the dockerfile/ubuntu image), which will print "It worked!!!!" on the screen:
sleep 5 && echo It worked!!!!
 Note: I added the ->sleep 5<- because the echo command returned so fast that Docker Compose tried to attach to the container which had already stopped, that cause it do wait forever.


After clicking execute got to http://127.0.0.1:8080/scheduler/ to see the progress of the Docker Compose script. Select the submitted job and you will see:


After the job is finished you can access the output by clicking the ‘Fetch output’ button in the Output tab:
Not glamorous but we can see the ‘It worked!!!’ in the line ‘[36mubuntuEcho_1 | [0mIt worked!!!’

Variables in Docker Compose workflows 


The new yaml file inside the task is:
ubuntuEcho:
  image: dockerfile/ubuntu
  command: /bin/bash -c 'sleep 5 && echo "$output_variable" '
The output of the echo has been replace with a variable: $output_variable.

  1. Create a new workflow 
  2. import the xml file (download it here). 
You should see a variable called output_variable in the 'Job Variables' tab:


Execute the workflow and check the output in the http://127.0.0.1:8080/scheduler/ interface. The variable has been replaced and the the echo will now output the content of the variable.

Friday, March 6, 2015

ProActive and R: a Machine Learning Example





The ProActive Scheduler is an open-source software to orchestrate, scale and monitor tasks among many hosts. It supports several languages, one of them is the statistical computations and graphics environment R. This environment is known for providing computational intensive functionality, so write your R scripts on a laptop and execute them on different, more powerful machines.


Docker container for portability and isolation

On the Cloud Expo Europe in London you can see an exciting and heavily developed feature of the ProActive Scheduler which is: Docker container support. Running tasks in containerized form has the advantage of increasing isolation between tasks and providing self defined environments in which you can them. Thought further, containers could be used as a replacement for tasks and run them in your environment inside a container. The possibilities are endless and you do not have to care about error recovery, network outage or other complications running in distributed environments, because the ProActive software deals with that.

Machine Learning with ProActive and R: Local Setup

Following, a few steps on how to install and run ProActive and finally execute an R script with the ProActive Scheduler. The following steps are done using an Ubuntu operating system.

Requirement:Installing the R Environment and RJava

Install the R Environment and RJava by typing:

    # sudo apt-get install r-base r-cran-rjava


Download ProActive

  1. Create an account on www.activeeon.com
  2. Download the current ProActive Workflows & Scheduling
  3. Unzip ProActiveWorkflowsScheduling-linux-x64-6.1.0.zip
  4. Download the ProActive-R-Connector (par-script-6.1.0.zip)
  5. Unzip par-script-6.1.0.zip into the ‘ProActiveWorkflowsScheduling-linux-x64-6.1.0/addons’ folder

Ready, you just installed ProActive and R support.

Start ProActive Server

Execute:

# ./ProActiveWorkflowsScheduling-linux-x64-6.1.0/bin/proactive-server

The standard setting will run the ProActive Scheduler and local 4 nodes.

Note: ProActiveWorkflowsScheduling-linux-x64-6.1.0 is the ProActive home directory, it might be called different when you downloaded a newer version.

Wait until you see “Get started at” showing the link to access the web-interface.


Start the ProActive Studio

The interface will show three possibilities, the most left orange circle is a link to the ProActive Studio, which is used to create workflows and execute them. Click on the left circle to open the Studio. Login with: admin and password admin

Create an R task

After creating a workflow and opening it, the interface will show a 'Tasks' drop down menu, select 'Language R'
to create an R task.



Add your R code

Add your R code, here you can download an altered example from http://computationalfinance.lsi.upc.edu/.
Add your code under the "Execution" menu which appears after selecting the R_Task.



Note: When R is executed on another machine, it must have all necessary packages installed and loaded, ensure it by installing packages in advance, it can be done within a script by specifying the library and the mirror


Add datasets to R_Task

The script will load a dataset, the SP500_Shiller dataset which you can download here. The R script will be send to one ProActive Node and to ensure that the node has the data we need to specify the data dependency inside the 'Data Management' settings of the R_Task. Specify the SP500_Shiller.csv as an input file from user-space. The file must be copied to ProActiveWorkflowsScheduling-linux-x64-6.1.0/data/defaultuser/admin which is the user-space for the admin user.



Specify output file inside Scheduler

The R script will output an image 'ml-result.png', to see the result we need to tell the task to copy it into our user-space after the task finishes. That is done by adding ml-result.png as an output file to user-space.



Access result

To see the result open ProActiveWorkflowsScheduling-linux-x64-6.1.0/data/defaultuser/admin – the user-space of the admin user - which contains the 'ml-results.png' after the R script finished.






Workflows for Big Data


One of ActiveEon most remarkable contributions to the French project DataScale is the possibility to execute ProActive workflows on HPC platforms. Why are these workflow so interesting? They have lots of features! Some that come to my mind:
  • Workflow data management mechanisms (take and bring your files from one task to another without need of shared file system)
  • Our workflow is made of tasks, which can be native tasks (execute installed applications)
  • Tasks can also be implemented on OS-independent script languages like: Groovy, Ruby, Python, Java, R, javascript, and more to come...
  • Tasks support dependencies (don't execute Y unless X finished), replication (execute this task N times in parallel), loop (keep executing this task given a condition), conditionals (execute this task, or that one, given a condition)
  • Error handling at job and task levels, different re-scheduling policies (what to do if your task fails?)
  • Inter-task variables passing mechanisms (let tasks communicate between them through variables)

By allowing to execute these kind of workflows, and the help of predefined workflow templates, your use case could be easily tackled. To have a more complete overview of our features please try our product at try.activeeon.com.  

One example of a use case is presented by the following demo video (enable subtitles for an explanation). Here we show how ProActive Workflows & Scheduling can be used on Big Data and HPC domains to:
1. Write any kind of workflows (involving tools like Hadoop, Unix command line tools, and even custom scripts on groovy).
2. Execute those workflows on an HPC platform.
3. Follow the execution of those workflows (tasks output, logs, execution time, execution node, etc.).
4. Have a workflow that prepares TXT book files for processing, word-count them (using Hadoop), generate a report, and upload such report on the cloud to make it public.



Maybe we can also help you boost your productivity, with ProActive!

Monday, March 2, 2015

Slurm, Big Data, Big Files on Lustre Parallel Distributed File System, and ActiveEon's Workflows

See the original post at datascale.org

This blog post is aimed to describe what are the Big Data solutions we have decided to use for the French project DataScale, and more important, the reason why we have chosen them.
Let’s first have a walk-through about ActiveEon and its participation in DataScale.

ActiveEon & DataScale

ActiveEon has been running in the market since 2007, helping customers to optimize the use of their own computer infrastructures.
activeeon
We often provide solutions to problems like under-exploitation of compute resources, business workflows that are too slow and that could be enormously accelerated, teams spending too much time on infrastructure management rather than on their own business. We do that. But lately we have been hearing increasingly about the same problem: Big Data being processed in regular HPC platforms. Research teams with growing amounts of data, plus, a platform that needs to somehow evolve to better handle it. So we decided to join DataScale project to make our product capable of providing answers to those questions.
datascale-logo-review
But not so quickly. When you undergo this situation, data management, hierarchical storage, fast distributed file systems, efficient job schedulers, flexible and research oriented workflows engine, isolation of your infrastructure, security, are just some of the concepts that must come to your mind, especially if you plan to make your platform evolve to satisfy most of the requirements properly. But do not panic. Please do not. We have a good solution (as we always do). In this article we will simply explain why DataScale and its tools may be your guide to walk towards the light.
You start from a regular HPC cluster, several hosts, hundreds of cores, good network connectivity, lots of memory. You probably use your own scripts to launch processes, and also have some sort of mechanism for data synchronization. In the best case you have your own native scheduler that eases your life a bit. Believe it, that is what we encounter in many of our customers. For people who now want to go Big Data, here there are some tips.

What file system?

Big Data requires big throughput and scalability. So better think about Lustre. Lustre is a parallel distributed file system, used in more than 50% of the TOP100 supercomputers. That seems to us like a good argument already. But there is more.
Clients benefit from its standard POSIX semantics, so you will be able to play with it on the client-side pretty much like you would do with EXT4NFS and other familiar file systems. That means less time learning new stuff for people that just want to use it. But, why not using my business application specific file system? Although that is a valid solution, we consider that most of the teams exploit infrastructures using several business applications rather than just one. So sticking to only one filesystem incompatible with your other applications would probably have an important impact on your work process, one way or another. For instance, imagine data created in one specific file system, let’s say HDFS, that needs to be read by a different application incompatible with HDFS. So before proceeding in the processing we should migrate this data to a different file system, let’s say NFS, so that it can be used by the tool that follows in the workflow… Difficult and at the end time consuming. Why to do that if performance can be at least kept the same? Probably there are better ways: just use Lustre, and configure tools to use Lustre. Even use it as a regular POSIX file system. Make all your Big Data applications work on it, as much as possible, and exploit its great performance. There are small tunes that can be done to your app so it behaves better with Lustre, one simple tip: use only big files.
For the sake of adding extra information, Lustre also offers HSM (Hierarchical Storage Management), which works as a cache system putting more regularly accessed data in faster levels of storage, while data that is less accessed remains in slower levels of storage. I will not enter into details, because my colleagues from Bull will surely do it in a coming blog post.

What scheduler?

Big platforms require resource management. Booking nodes, complex allocation, scheduling algorithms. As we do not like reinventing the wheel, we use the scheduler that more than 50% of the TOP500 supercomputers use: SLURM.
Free, open-source, Linux native, very customizable and efficient, it is definitely a must have in your HPC cluster. It offers a maximum of around 1000 job submissions per second. A job is made of a shell script with annotations that will be interpreted by SLURM at submission time. The annotations are very simple, so if you are familiar with any shell scripting language you can quickly learn how to write SLURM jobs. With the use of a distributed file system such as Lustre, SLURM becomes a very powerful and simple tool, usable by almost anyone.

Why a Workflow Engine?

A flexible Workflow Engine adds additional flexibility to your platform, specially if it can be integrated with SLURM. This is the case if we talk about ProActive Workflows & Scheduling. This beauty offers some extra features to the infrastructure: support for flexible multi-task ProActive workflows, task dependencies, Web Studio for easy creation of ProActive Workflows, cron-based-submission jobs, control blocks like replication and loops, dataspaces, templates of jobs for management of data in cloud storage services including support for more than 8 different file systems, templates of jobs for interaction with SLURM and HADOOP, among others.
If multiple languages is what you are looking for, you need to know that ProActive tasks can be implemented using several languages: Java, Javascript, Groovy, Ruby, Python, Bash, Cmd or R. Also you can execute native Windows/Mac OS/Linux processes.
There is a Node Source integrated to SLURM: in other words, if ProActive requires compute resources, they can be taken from the SLURM pool of resources and added to ProActive Workflows & Scheduling so that ProActive Workflows are executed in SLURM nodes. After execution of such Workflows, SLURM nodes are released.
ProActive also offers monitoring of involved resources and possibility to extend the infrastructure using private clouds (such as OpenStackVMWare) and public clouds (such as NumergyWindows Azure).
Last but not least, ProActive provides a flexible mechanism for centralized authentication. It means that after user credentials loading (procedure done only once) by doing a simple user/password initial login, the execution of ProActive workflows will be done with no further password request, no matter what services are invoked by such workflow. Imagine your workflow accessing Cloud Storage accounts, executing business applications using given accounts, changing file permissions using specific credentials for Linux accounts,  etc. All user credentials will be safely stored in the ProActive Third Party Credential Store, once.
Request for execution of any workflow will be possible via a simple REST API call, making it simple to trigger your data processing from any cloud service.

Databases?

Armadillo rocks when it comes to seismic use cases. ArmadilloDB has been optimized to work over Lustre FS and added support to perform correlation operations over seismic big data. But, not my topic, I will let them better explain it in a different blog post.

What does it give?

Having placed every piece in its place, we have an interesting result. It is a platform that allows you to bring data from outside the cluster (several cloud storage services supported) using intelligent workflows, process it via either SLURM, ProActive tasks (implementable in more than 8 languages), or external processes such as ArmadilloDB or Hadoop, and manipulate it to move it to a convenient place. But to give you a more clear example of what your work with a DataScale platform would look like, I will share with you a simple video (enable subtitles to better understand):


In this video the user executes a simple SLURM job on data available on a DataScale powered cluster. Then results are put back in a cloud storage server. All is done through a command line client that makes REST calls to ProActive Workflow Catalog server, a module of ProActive Workflows & Scheduling product. For now we will not show performance results, we will do it in coming blog posts.
Hope you enjoyed!