Activeeon Team Blog: March 2015

Friday, March 6, 2015

ProActive and R: a Machine Learning Example

The ProActive Scheduler is an open-source software to orchestrate, scale and monitor tasks among many hosts. It supports several languages, one of them is the statistical computations and graphics environment R. This environment is known for providing computational intensive functionality, so write your R scripts on a laptop and execute them on different, more powerful machines.

Docker container for portability and isolation

On the Cloud Expo Europe in London you can see an exciting and heavily developed feature of the ProActive Scheduler which is: Docker container support. Running tasks in containerized form has the advantage of increasing isolation between tasks and providing self defined environments in which you can them. Thought further, containers could be used as a replacement for tasks and run them in your environment inside a container. The possibilities are endless and you do not have to care about error recovery, network outage or other complications running in distributed environments, because the ProActive software deals with that.

Machine Learning with ProActive and R: Local Setup

Following, a few steps on how to install and run ProActive and finally execute an R script with the ProActive Scheduler. The following steps are done using an Ubuntu operating system.

Requirement:Installing the R Environment and RJava

Install the R Environment and RJava by typing:

    # sudo apt-get install r-base r-cran-rjava

Download ProActive

Create an account on www.activeeon.com
Download the current ProActive Workflows & Scheduling
Unzip ProActiveWorkflowsScheduling-linux-x64-6.1.0.zip
Download the ProActive-R-Connector (par-script-6.1.0.zip)
Unzip par-script-6.1.0.zip into the ‘ProActiveWorkflowsScheduling-linux-x64-6.1.0/addons’ folder

Ready, you just installed ProActive and R support.

Start ProActive Server

Execute:

# ./ProActiveWorkflowsScheduling-linux-x64-6.1.0/bin/proactive-server

The standard setting will run the ProActive Scheduler and local 4 nodes.

Note: ProActiveWorkflowsScheduling-linux-x64-6.1.0 is the ProActive home directory, it might be called different when you downloaded a newer version.

Wait until you see “Get started at” showing the link to access the web-interface.

Start the ProActive Studio

The interface will show three possibilities, the most left orange circle is a link to the ProActive Studio, which is used to create workflows and execute them. Click on the left circle to open the Studio. Login with: admin and password admin

Create an R task

After creating a workflow and opening it, the interface will show a 'Tasks' drop down menu, select 'Language R'
to create an R task.

Add your R code

Add your R code, here you can download an altered example from http://computationalfinance.lsi.upc.edu/.
Add your code under the "Execution" menu which appears after selecting the R_Task.

Note: When R is executed on another machine, it must have all necessary packages installed and loaded, ensure it by installing packages in advance, it can be done within a script by specifying the library and the mirror

Add datasets to R_Task

The script will load a dataset, the SP500_Shiller dataset which you can download here. The R script will be send to one ProActive Node and to ensure that the node has the data we need to specify the data dependency inside the 'Data Management' settings of the R_Task. Specify the SP500_Shiller.csv as an input file from user-space. The file must be copied to ProActiveWorkflowsScheduling-linux-x64-6.1.0/data/defaultuser/admin which is the user-space for the admin user.

Specify output file inside Scheduler

The R script will output an image 'ml-result.png', to see the result we need to tell the task to copy it into our user-space after the task finishes. That is done by adding ml-result.png as an output file to user-space.

Access result

To see the result open ProActiveWorkflowsScheduling-linux-x64-6.1.0/data/defaultuser/admin – the user-space of the admin user - which contains the 'ml-results.png' after the R script finished.

Workflows for Big Data

One of ActiveEon most remarkable contributions to the French project DataScale is the possibility to execute ProActive workflows on HPC platforms. Why are these workflow so interesting? They have lots of features! Some that come to my mind:

Workflow data management mechanisms (take and bring your files from one task to another without need of shared file system)
Our workflow is made of tasks, which can be native tasks (execute installed applications)
Tasks can also be implemented on OS-independent script languages like: Groovy, Ruby, Python, Java, R, javascript, and more to come...
Tasks support dependencies (don't execute Y unless X finished), replication (execute this task N times in parallel), loop (keep executing this task given a condition), conditionals (execute this task, or that one, given a condition)
Error handling at job and task levels, different re-scheduling policies (what to do if your task fails?)
Inter-task variables passing mechanisms (let tasks communicate between them through variables)

By allowing to execute these kind of workflows, and the help of predefined workflow templates, your use case could be easily tackled. To have a more complete overview of our features please try our product at try.activeeon.com.

One example of a use case is presented by the following demo video (enable subtitles for an explanation). Here we show how ProActive Workflows & Scheduling can be used on Big Data and HPC domains to:

1. Write any kind of workflows (involving tools like Hadoop, Unix command line tools, and even custom scripts on groovy).

2. Execute those workflows on an HPC platform.

3. Follow the execution of those workflows (tasks output, logs, execution time, execution node, etc.).

4. Have a workflow that prepares TXT book files for processing, word-count them (using Hadoop), generate a report, and upload such report on the cloud to make it public.

Maybe we can also help you boost your productivity, with ProActive!

Monday, March 2, 2015

Slurm, Big Data, Big Files on Lustre Parallel Distributed File System, and ActiveEon's Workflows

See the original post at datascale.org

This blog post is aimed to describe what are the Big Data solutions we have decided to use for the French project DataScale, and more important, the reason why we have chosen them.

Let’s first have a walk-through about ActiveEon and its participation in DataScale.

ActiveEon & DataScale

ActiveEon has been running in the market since 2007, helping customers to optimize the use of their own computer infrastructures.

We often provide solutions to problems like under-exploitation of compute resources, business workflows that are too slow and that could be enormously accelerated, teams spending too much time on infrastructure management rather than on their own business. We do that. But lately we have been hearing increasingly about the same problem: Big Data being processed in regular HPC platforms. Research teams with growing amounts of data, plus, a platform that needs to somehow evolve to better handle it. So we decided to join DataScale project to make our product capable of providing answers to those questions.

But not so quickly. When you undergo this situation, data management, hierarchical storage, fast distributed file systems, efficient job schedulers, flexible and research oriented workflows engine, isolation of your infrastructure, security, are just some of the concepts that must come to your mind, especially if you plan to make your platform evolve to satisfy most of the requirements properly. But do not panic. Please do not. We have a good solution (as we always do). In this article we will simply explain why DataScale and its tools may be your guide to walk towards the light.

You start from a regular HPC cluster, several hosts, hundreds of cores, good network connectivity, lots of memory. You probably use your own scripts to launch processes, and also have some sort of mechanism for data synchronization. In the best case you have your own native scheduler that eases your life a bit. Believe it, that is what we encounter in many of our customers. For people who now want to go Big Data, here there are some tips.

What file system?

Big Data requires big throughput and scalability. So better think about Lustre. Lustre is a parallel distributed file system, used in more than 50% of the TOP100 supercomputers. That seems to us like a good argument already. But there is more.

Clients benefit from its standard POSIX semantics, so you will be able to play with it on the client-side pretty much like you would do with EXT4, NFS and other familiar file systems. That means less time learning new stuff for people that just want to use it. But, why not using my business application specific file system? Although that is a valid solution, we consider that most of the teams exploit infrastructures using several business applications rather than just one. So sticking to only one filesystem incompatible with your other applications would probably have an important impact on your work process, one way or another. For instance, imagine data created in one specific file system, let’s say HDFS, that needs to be read by a different application incompatible with HDFS. So before proceeding in the processing we should migrate this data to a different file system, let’s say NFS, so that it can be used by the tool that follows in the workflow… Difficult and at the end time consuming. Why to do that if performance can be at least kept the same? Probably there are better ways: just use Lustre, and configure tools to use Lustre. Even use it as a regular POSIX file system. Make all your Big Data applications work on it, as much as possible, and exploit its great performance. There are small tunes that can be done to your app so it behaves better with Lustre, one simple tip: use only big files.

For the sake of adding extra information, Lustre also offers HSM (Hierarchical Storage Management), which works as a cache system putting more regularly accessed data in faster levels of storage, while data that is less accessed remains in slower levels of storage. I will not enter into details, because my colleagues from Bull will surely do it in a coming blog post.

What scheduler?

Big platforms require resource management. Booking nodes, complex allocation, scheduling algorithms. As we do not like reinventing the wheel, we use the scheduler that more than 50% of the TOP500 supercomputers use: SLURM.

Free, open-source, Linux native, very customizable and efficient, it is definitely a must have in your HPC cluster. It offers a maximum of around 1000 job submissions per second. A job is made of a shell script with annotations that will be interpreted by SLURM at submission time. The annotations are very simple, so if you are familiar with any shell scripting language you can quickly learn how to write SLURM jobs. With the use of a distributed file system such as Lustre, SLURM becomes a very powerful and simple tool, usable by almost anyone.

Why a Workflow Engine?

A flexible Workflow Engine adds additional flexibility to your platform, specially if it can be integrated with SLURM. This is the case if we talk about ProActive Workflows & Scheduling. This beauty offers some extra features to the infrastructure: support for flexible multi-task ProActive workflows, task dependencies, Web Studio for easy creation of ProActive Workflows, cron-based-submission jobs, control blocks like replication and loops, dataspaces, templates of jobs for management of data in cloud storage services including support for more than 8 different file systems, templates of jobs for interaction with SLURM and HADOOP, among others.

If multiple languages is what you are looking for, you need to know that ProActive tasks can be implemented using several languages: Java, Javascript, Groovy, Ruby, Python, Bash, Cmd or R. Also you can execute native Windows/Mac OS/Linux processes.

There is a Node Source integrated to SLURM: in other words, if ProActive requires compute resources, they can be taken from the SLURM pool of resources and added to ProActive Workflows & Scheduling so that ProActive Workflows are executed in SLURM nodes. After execution of such Workflows, SLURM nodes are released.

ProActive also offers monitoring of involved resources and possibility to extend the infrastructure using private clouds (such as OpenStack, VMWare) and public clouds (such as Numergy, Windows Azure).

Last but not least, ProActive provides a flexible mechanism for centralized authentication. It means that after user credentials loading (procedure done only once) by doing a simple user/password initial login, the execution of ProActive workflows will be done with no further password request, no matter what services are invoked by such workflow. Imagine your workflow accessing Cloud Storage accounts, executing business applications using given accounts, changing file permissions using specific credentials for Linux accounts, etc. All user credentials will be safely stored in the ProActive Third Party Credential Store, once.

Request for execution of any workflow will be possible via a simple REST API call, making it simple to trigger your data processing from any cloud service.

Databases?

Armadillo rocks when it comes to seismic use cases. ArmadilloDB has been optimized to work over Lustre FS and added support to perform correlation operations over seismic big data. But, not my topic, I will let them better explain it in a different blog post.

What does it give?

Having placed every piece in its place, we have an interesting result. It is a platform that allows you to bring data from outside the cluster (several cloud storage services supported) using intelligent workflows, process it via either SLURM, ProActive tasks (implementable in more than 8 languages), or external processes such as ArmadilloDB or Hadoop, and manipulate it to move it to a convenient place. But to give you a more clear example of what your work with a DataScale platform would look like, I will share with you a simple video (enable subtitles to better understand):

In this video the user executes a simple SLURM job on data available on a DataScale powered cluster. Then results are put back in a cloud storage server. All is done through a command line client that makes REST calls to ProActive Workflow Catalog server, a module of ProActive Workflows & Scheduling product. For now we will not show performance results, we will do it in coming blog posts.

Hope you enjoyed!

Links