Activeeon Team Blog: Error Management Best Practices in Production?

In various situations, advanced error management tools save time and money. For instance, in production where problems must be solved without human intervention to ensure high-availability, unstable environment like SPOT instances where errors are commonplace and disrupting, as well as resources-intensive workflows which mustn't be run twice due to their cost.
To manage instabilities, ProActive offers many functionalities helping you with errors.

Functionalities and Support

The Advanced Error Management feature provides multiple options in case of failure. First, the number of attempts for any given task can be selected with the ability to choose whether to run subsequent iterations on a different node. Then, different policies will control the job behavior which can be to cancel or continue the job after last attempt, or suspend the task or the job after the first failure. These parameters can be overridden at task level. When a task is suspended, it is possible to mark it as finished through the scheduler portal, the scheduler will then proceed as if the task had ended without failures. Use it when the error as been solved by an outer intervention.

This feature is fully accessible on the enterprise version (≥7.19) which is available on try.activeeon.com .

The Selection Script offers the ability to select nodes with wanted parameters. For instance, it can prevent errors related to software or packages availability. It also insures relevant access rights and security are available on the execution node. Finally, it allows to dynamically execute tasks on nodes with selected resources available (CPU, RAM, GPU) even if other resources are intensively used. (Selection scripts can be found under “Node Selection” in the task menu).
Samples of such scripts are available under PROACTIVE_HOME/samples/scripts/selection.
The Console Output and Logs give details on the faulty task and errors. In some circumstances looking at the logs will be required to retrieve additional information. They are both available in the scheduler portal.
The logs are also available at PROACTIVE_HOME/logs.

The Pre-script allows you to prepare the node environment where the task will be executed. For instance, it can install packages through a bash script before a Python task. Post-script may be used to reverse these changes. However, a Clean-script is more adapted in such situation since it will be executed in case of errors in the pre-script, post-script or in the task. Even though the output is not available, information regarding the state of the environment (variable, packages, etc.) will be written in the logs.
Examples of pre/post/clean -scripts are available in the studio under the “Templates” menu.
The ProActive Agent can restart failing nodes and thus insure high availability of resources. More info at the administrator guide.
The ProActive support will help recover from more complex issues, in more complex cases. They will reproduce the errors and send you appropriate recommendations that will fit your environment and constraints.
All of those can help you create a workflow adapted to your error management situation. You should also try to follow those advices to insure high availability and best behaviours.

Best practices

When writing a workflow, error handling policy should be part of the design process. You should ponder:

how critical is this: can you pause it for hours?
how long does it takes: can you afford to execute it multiple times?
how stable is the environment: will connections crumble during the job leading to errors?
how skilled are your developers: how likely are errors?
how skilled is your maintenance team: what can they do in case of error?
how many workflows do you run per unit of time?

After answering these questions, select the error handling policy which fits the global needs and then update the parameters for individual tasks requiring different behaviour.
When using tasks with special requirements (OS, packages, etc.) make sure to use a selection script to target the right nodes.
When packages are installed, usually in the pre-script, do not forget to uninstall them in the clean-script of the same task. This will leave the node in their pre-task state. Alternately, Docker containers can be used to avoid those issues.

Links

Thursday, October 13, 2016

Error Management Best Practices in Production?

Functionalities and Support

Best practices

No comments:

Post a Comment