Friday, March 6, 2015

Workflows for Big Data

One of ActiveEon most remarkable contributions to the French project DataScale is the possibility to execute ProActive workflows on HPC platforms. Why are these workflow so interesting? They have lots of features! Some that come to my mind:
  • Workflow data management mechanisms (take and bring your files from one task to another without need of shared file system)
  • Our workflow is made of tasks, which can be native tasks (execute installed applications)
  • Tasks can also be implemented on OS-independent script languages like: Groovy, Ruby, Python, Java, R, javascript, and more to come...
  • Tasks support dependencies (don't execute Y unless X finished), replication (execute this task N times in parallel), loop (keep executing this task given a condition), conditionals (execute this task, or that one, given a condition)
  • Error handling at job and task levels, different re-scheduling policies (what to do if your task fails?)
  • Inter-task variables passing mechanisms (let tasks communicate between them through variables)

By allowing to execute these kind of workflows, and the help of predefined workflow templates, your use case could be easily tackled. To have a more complete overview of our features please try our product at  

One example of a use case is presented by the following demo video (enable subtitles for an explanation). Here we show how ProActive Workflows & Scheduling can be used on Big Data and HPC domains to:
1. Write any kind of workflows (involving tools like Hadoop, Unix command line tools, and even custom scripts on groovy).
2. Execute those workflows on an HPC platform.
3. Follow the execution of those workflows (tasks output, logs, execution time, execution node, etc.).
4. Have a workflow that prepares TXT book files for processing, word-count them (using Hadoop), generate a report, and upload such report on the cloud to make it public.

Maybe we can also help you boost your productivity, with ProActive!

No comments:

Post a Comment