This blog post is aimed to describe what are the Big Data solutions we have decided to use for the French project DataScale, and more important, the reason why we have chosen them.
Let’s first have a walk-through about ActiveEon and its participation in DataScale.
ActiveEon & DataScale
ActiveEon has been running in the market since 2007, helping customers to optimize the use of their own computer infrastructures.
We often provide solutions to problems like under-exploitation of compute resources, business workflows that are too slow and that could be enormously accelerated, teams spending too much time on infrastructure management rather than on their own business. We do that. But lately we have been hearing increasingly about the same problem: Big Data being processed in regular HPC platforms. Research teams with growing amounts of data, plus, a platform that needs to somehow evolve to better handle it. So we decided to join DataScale project to make our product capable of providing answers to those questions.
But not so quickly. When you undergo this situation, data management, hierarchical storage, fast distributed file systems, efficient job schedulers, flexible and research oriented workflows engine, isolation of your infrastructure, security, are just some of the concepts that must come to your mind, especially if you plan to make your platform evolve to satisfy most of the requirements properly. But do not panic. Please do not. We have a good solution (as we always do). In this article we will simply explain why DataScale and its tools may be your guide to walk towards the light.
You start from a regular HPC cluster, several hosts, hundreds of cores, good network connectivity, lots of memory. You probably use your own scripts to launch processes, and also have some sort of mechanism for data synchronization. In the best case you have your own native scheduler that eases your life a bit. Believe it, that is what we encounter in many of our customers. For people who now want to go Big Data, here there are some tips.
What file system?
Big Data requires big throughput and scalability. So better think about Lustre. Lustre is a parallel distributed file system, used in more than 50% of the TOP100 supercomputers. That seems to us like a good argument already. But there is more.
Clients benefit from its standard POSIX semantics, so you will be able to play with it on the client-side pretty much like you would do with EXT4, NFS and other familiar file systems. That means less time learning new stuff for people that just want to use it. But, why not using my business application specific file system? Although that is a valid solution, we consider that most of the teams exploit infrastructures using several business applications rather than just one. So sticking to only one filesystem incompatible with your other applications would probably have an important impact on your work process, one way or another. For instance, imagine data created in one specific file system, let’s say HDFS, that needs to be read by a different application incompatible with HDFS. So before proceeding in the processing we should migrate this data to a different file system, let’s say NFS, so that it can be used by the tool that follows in the workflow… Difficult and at the end time consuming. Why to do that if performance can be at least kept the same? Probably there are better ways: just use Lustre, and configure tools to use Lustre. Even use it as a regular POSIX file system. Make all your Big Data applications work on it, as much as possible, and exploit its great performance. There are small tunes that can be done to your app so it behaves better with Lustre, one simple tip: use only big files.
For the sake of adding extra information, Lustre also offers HSM (Hierarchical Storage Management), which works as a cache system putting more regularly accessed data in faster levels of storage, while data that is less accessed remains in slower levels of storage. I will not enter into details, because my colleagues from Bull will surely do it in a coming blog post.
Big platforms require resource management. Booking nodes, complex allocation, scheduling algorithms. As we do not like reinventing the wheel, we use the scheduler that more than 50% of the TOP500 supercomputers use: SLURM.
Free, open-source, Linux native, very customizable and efficient, it is definitely a must have in your HPC cluster. It offers a maximum of around 1000 job submissions per second. A job is made of a shell script with annotations that will be interpreted by SLURM at submission time. The annotations are very simple, so if you are familiar with any shell scripting language you can quickly learn how to write SLURM jobs. With the use of a distributed file system such as Lustre, SLURM becomes a very powerful and simple tool, usable by almost anyone.
Why a Workflow Engine?
A flexible Workflow Engine adds additional flexibility to your platform, specially if it can be integrated with SLURM. This is the case if we talk about ProActive Workflows & Scheduling. This beauty offers some extra features to the infrastructure: support for flexible multi-task ProActive workflows, task dependencies, Web Studio for easy creation of ProActive Workflows, cron-based-submission jobs, control blocks like replication and loops, dataspaces, templates of jobs for management of data in cloud storage services including support for more than 8 different file systems, templates of jobs for interaction with SLURM and HADOOP, among others.
There is a Node Source integrated to SLURM: in other words, if ProActive requires compute resources, they can be taken from the SLURM pool of resources and added to ProActive Workflows & Scheduling so that ProActive Workflows are executed in SLURM nodes. After execution of such Workflows, SLURM nodes are released.
ProActive also offers monitoring of involved resources and possibility to extend the infrastructure using private clouds (such as OpenStack, VMWare) and public clouds (such as Numergy, Windows Azure).
Last but not least, ProActive provides a flexible mechanism for centralized authentication. It means that after user credentials loading (procedure done only once) by doing a simple user/password initial login, the execution of ProActive workflows will be done with no further password request, no matter what services are invoked by such workflow. Imagine your workflow accessing Cloud Storage accounts, executing business applications using given accounts, changing file permissions using specific credentials for Linux accounts, etc. All user credentials will be safely stored in the ProActive Third Party Credential Store, once.
Request for execution of any workflow will be possible via a simple REST API call, making it simple to trigger your data processing from any cloud service.
Armadillo rocks when it comes to seismic use cases. ArmadilloDB has been optimized to work over Lustre FS and added support to perform correlation operations over seismic big data. But, not my topic, I will let them better explain it in a different blog post.
What does it give?
Having placed every piece in its place, we have an interesting result. It is a platform that allows you to bring data from outside the cluster (several cloud storage services supported) using intelligent workflows, process it via either SLURM, ProActive tasks (implementable in more than 8 languages), or external processes such as ArmadilloDB or Hadoop, and manipulate it to move it to a convenient place. But to give you a more clear example of what your work with a DataScale platform would look like, I will share with you a simple video (enable subtitles to better understand):
In this video the user executes a simple SLURM job on data available on a DataScale powered cluster. Then results are put back in a cloud storage server. All is done through a command line client that makes REST calls to ProActive Workflow Catalog server, a module of ProActive Workflows & Scheduling product. For now we will not show performance results, we will do it in coming blog posts.