Wednesday, October 26, 2016

Reliable Execution in each Environment with Docker and ProActive

Using Docker containers allows the code to be independent from the underlying environment. Being heavily multi-platforms, ProActive offers different way to manage dependencies including the use of containers, detailed in this article.

How to start a container in ProActive

In ProActive, launching a task within a container is a simple 3 steps:

  • go to the “Fork Environment” tab,
  • select on the Java Home field the path to the Java installation directory on the node,
  • finally in the environment script field, fill the “preJavaHomeCmd” variable with the docker command that will start the appropriate container. For instance, preJavaHomeCmd = ‘docker run -p 80:8080 -rm java’.
After that, you’re done.

Let’s start with an example

A picture paints a thousand words. Below is an example based on a previous article that will be containerized. It is possible to follow those instructions in try.activeeon.com or after installing ProActive on your server.

Original workflow and container

This workflow is used as a base. It requires R and several specific packages which are not available on every node. Containers will embrace all those dependencies.

First, create a docker image with the following software installed:

  • R,
  • Java8,
  • rJava,
  • libcurl,-dev
  • libssl-dev.
On the Dockerfile, run this command “RUN R CMD javareconf -e” for configuration purposes and add the required packages:
  • stringr,
  • tm,
  • wordcloud,
  • twitteR,
  • textreg,
  • mailR.
The image must be available on all the nodes or in Docker Hub for download.

In the “Fork Environment” tab of every R task:

  1. choose Docker as the Fork Execution Environment; this will fill the fields according to a template,
  2. choose “User Defined” to be able to update the fields,
  3. in the “Environment Script”, change the “containerName” variable to the name of the image to use. (Watch out, it is the image name and not the container name.)

Here is the link to a docker hub container with the required packages.

Let’s remove the display from our original workflow

The current workflow displays the wordcloud. However, this is not available anymore since the R task is run within a container.

In the replicated task, find

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, Locations$cities[variables[["PA_TASK_REPLICATION"]]+1], col= "blue")
wordcloud(corpus, scale=c(5,2),rot.per = 0.25, random.color=T, max.word=45, random.order=F, colors=col, main="Title")

#save to file in dataspace
dev.copy(png,paste(Locations$cities[variables[["PA_TASK_REPLICATION"]]+1], '.png', sep=""))
dev.off()

then replace it by the code below
png(paste(Locations$cities[variables[["PA_TASK_REPLICATION"]]+1], ".png", sep = ""))
plot.new()
wordcloud(corpus, scale=c(5,2),rot.per = 0.25, random.color=T, max.word=45, random.order=F, colors=col)
dev.off()
Mirror this change in the final task.

The results are available in the dataspace. When using try.activeeon.com, the dataspace can be accessed here (same login and password as the try platform). When using it on a local server, the dataspace is available in PROACTIVE_HOME/data/defaultuser/USER

If the dataspace is inaccessible, two courses of actions can be taken: sending the images via email or downloading them.

Send wordclouds through Email

To send them through emails, add a workflow variable “Email” corresponding to the recipient and write this code in the replicated task (and its equivalent in the last task)

library(mailR)
sender <- br="" ourmailadress=""> recipients <- br="" c="" mail="" tostring="" variables=""> email <- from="sender,<br" send.mail=""> to = recipients,
subject="ProActiveInContainer",
body = "Partial Analysis",
smtp = list(host.name = "smtp.gmail.com", user.name = "YourMailAdress", passwd = "YourPassword", ssl = TRUE),
authenticate = TRUE,
attach.files = c(paste(localspace, "/", Locations$cities[variables[["PA_TASK_REPLICATION"]]+1], ".png", sep = "")),
file.names = c(paste(Locations$cities[variables[["PA_TASK_REPLICATION"]]+1], ".png", sep = "")),
send = TRUE)
print ('File Send')

If gmail is not the email provider of the sender, the “smtp” argument needs to be modified.

Download wordcloud through ProActive Scheduler

It is possible and easy to download the “result” variable of a task through the scheduler. In the preview tab, a file will be available for download including the content of the result. An image can also be handled this way when stored as a variable (binary).

To achieve this result, a new task needs to be created with the content of the image in its “result” variable (which can then be downloaded).
Create this task after the “Merge” one and add the userspace as input files (“Data Management” tab). Set the task’s language to Groovy and add the code below:

fileName = "Main.png"
file = new File(fileName)
result = file.getBytes()
resultMetadata.put("file.name", fileName)
resultMetadata.put("content.type", "image/png")

To retrieve the images (after running the job) go to the scheduler, then go to the “Preview” tab, select your new task and click on "Open in browser" or "Save as file"

For advanced users

To do the same to all the images, the downloading task will need to be replicated too. It is possible to replicate multiple tasks at once but they need to be in a ‘block’ starting at the target of the ‘replicate’ control. This will introduce changes in the workflow to comply with the rules of this new structure.

"Task blocks are very similar to the parenthesis of most programming languages: anonymous and nested start/end tags. The only difference is that a parenthesis is a syntactical information, whereas task blocks are semantic."

Here is the way to write it:

  • create an additional Groovy task. It will pass the content of the images in the “result” variable (as seen above) (called Images in this picture),
  • create an additional R task to keep the integrity of the workflow due to the ‘block’ structure (called Additional in this picture),
  • create dependencies so that the workflow will look like shown on the left (names may be different), (be sure to create the dependency from “Analysis” to “Additional” before the one from “Images” to “Additional” in order to get Analysis’ result first ) results will be aggregated according to the order declared in the dependency list, in the Studio, it is the order in which the dependencies are created,
  • using the tasks’ “Control Flow” tab, set “Analysis” as a the start of a block and “Additional” as its end so that the 3 tasks will be replicated,
  • at the end of the “Split” task add the line “ variables[['names']] <- able="" access="" any="" be="" cities="" from="" it="" li="" task.="" to="">
To the R task called Additional add the code below to give the right “result” to “Merge”.
result <- results[[1]]
To the Groovy task called Images add the userspace as input files and add the code below.
replication = variables.get('PA_TASK_REPLICATION')
fileName = variables.get("names")[replication] + ".png"
file = new File(fileName)

result = file.getBytes()
resultMetadata.put("file.name", fileName)
resultMetadata.put("content.type", "image/png")


The workflow is ready.

No comments:

Post a Comment