Thursday, September 29, 2016

How to accelerate your R algorithm with ProActive?

When analyzing sets of data, getting “instant” insight will often be critical for business decision. Which is why one may want to allocate all his/her resources on a set of tasks delivering analysis for stakeholders. Here is an example of a Twitter feed analysis using R (statistical oriented language) and ProActive. It gathers messages, then performs distributed analysis on distinct computing nodes.

How it works

In selected cities, this workflow will search Twitter feeds containing a chosen set of keywords and return a chosen number of tweets. (Due to restrictions from Twitter, the number of towns cannot be greater than 15: “Rate limits are divided into 15 minute intervals [...] 15 calls every 15 minutes.”, https://dev.twitter.com/rest/public/rate-limiting)

ProActive first creates as many tasks as the number of cities then run them in parallel using ProActive nodes. In those R-based tasks, each node will:

  • connect to Twitter,
  • perform the desired search,
  • analyze the information,
  • store the result in the ProActive dataspace (a shared storage environment adapted to distributed systems).
A last task analyzes all the data before storing them in the above-mentioned dataspace.


Let's code