Friday, November 11, 2016

Optimized Algorithm Distribution

With the growth rate of devices and sensors, analyzing collected data becomes more complex. Data scientists and analysts have to deal with heterogeneous sources and leverage distributed environment to improve analysis time. To achieve this, the ability to split a complex algorithm into smaller units is fundamental to take advantage of distributed environments. This article will detail ActiveEon’s approach.

To distribute efficiently with ProActive, ActiveEon has defined a task as the smallest unit of work that can be executed. A composition of tasks is called a workflow and includes additional features such as dependencies, variable propagation and shared resources. By following this principle, distribution can be taken into consideration by design since dependencies are explicit. The workflow designer is therefore in control of the distribution and can heavily optimize his/her resource utilization and reduce the execution time.

Terminology

A task is the smallest unit of work.
A workflow is composition of tasks which includes additional features such as dependencies, variable propagation and shared resources.
A job is a workflow that has been submitted to the scheduler. (Multiple jobs can be created from the same workflow.)

It is important to distinguish Workflows and Jobs. A Workflow is a directed graph of Tasks where Tasks represent the smallest units of work that can be distributed and executed on nodes. A Workflow can be seen as a model or a template. A Job is an instance of a Workflow which has been submitted to the Scheduler and is managed by this last. Jobs differ from Workflows in many ways: Jobs may have variables value update at runtime, controls such as loops which are expanded at runtime, etc.

Examples

A picture paints a thousand words, so here are a few examples.

In this example, dependencies can be identified with arrows. It is for instance simple for the scheduler to understand that the Javascript task can be run in parallel with the R, Bash or Python tasks. However, the Groovy task will be executed only when all the other tasks are completed because of dependencies.

In this specific replicate structure, the Process task contains some code that will be executed over a dataset (array, storage, etc.) where each unit is independent from the other. The scheduler can consequently parallelize these independent tasks. The final Merge task will gather all the results collected for further analysis. However, in the loop structure, it can be noticed that the Start task is dependent from the loop task, no distribution is possible.

No comments:

Post a Comment