Tuesday, July 21, 2015

Portfolio risk diversification with Spark & ProActive Workflows

Authors: Michael Benguigui  and Iyad Alshabani

In order to design a financial tool to diversificate risks, we describe the following specific spark jobs orchestrated by ProActive Scheduler. Since we need to process huge amounts of asset prices and achieve streaming k-means to keep models up to date, Spark (http://spark.apache.org/) is definitely the right technology. Firstly, MLlib (Spark’s scalable machine learning library https://spark.apache.org/mllib/) offers functionalities for streaming data mining. Secondly, Spark allows to work with advanced map/reduce instructions on RDDs (Resilient Distributed Dataset), i.e. partitioned collections of elements that can be operated in parallel. All these jobs required a powerful framework to be deployed, orchestrated and monitored, as provided by ProActive.


Spark streaming jobs 


Spark Cluster

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers to allocate resources across applications: Spark’s own standalone cluster manager, Mesos (http://mesos.apache.org/), YARN (http://fr.hortonworks.com/hadoop/yarn/). Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.

ProActive Workflow, Scheduler & Resource Manager

Spark jobs are created and deployed using Proactive Workflows, offering task templatization to facilitate workflow elaboration. Indeed, Proactive Workflow & Scheduling is an integrated solution for deploying, executing and managing complex workflows in the cloud and on premise as well. All jobs are orchestrated by the ProActive Scheduler and the ProActive Resource Manager virtualizes and monitors  resources.
The ProActive Scheduler and Resource Manager are deployed in a way that monitors each Spark Cluster which, in its turn, monitors its applications. The ProActive scheduler is used as a meta scheduler of the set of Spark clusters. Indeed, each Spark cluster has its own Spark scheduler. ProActive then orchestrates the set of clusters and jobs as shown with the next figure.

Architecture of our application dedicated to diversificate risks

Spark Streaming jobs

We focus on the CAC40 stocks correlations analysis over short periods. Indeed, our spark streaming jobs require a batch duration parameter to be set. This parameter controls the time period in seconds, during which the job stream input data, before processing. The main Spark jobs of our financial tool are described as follows:
  • The first streaming job queries data from Yahoo! or Google web services (user choice), cleans data for the following jobs, and writes data in a shared directory. The batch duration is set by the user, and only successive distinct quotes are stored. For instance, considering a 10s batch duration, quotes will be queried during 10s before being processed and written in a shared directory.
  • A second Spark streaming job keeps up to date the correlation matrix coefficients (“Spearman” correlations in MLlib, against the commonly used “Pearson” correlations not applicable, if we consider the price processes are lognormally distributed as the Black & Scholes model states) by streaming the processed stock quotes from the previous job.
  • A specific Scala job is in charge of drawing a heatmap from a matrix. Applied to the estimated correlations, it depicts the constantly updated correlation coefficients with associated colors.

CAC40 correlation heatmap

  • Relying on KMeans algorithm (MLlib), another Spark streaming job clusterizes correlations to build a well-diversified portfolio. KMeans clustering aims to partition a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean.
  • Here, the drawing job is used for the clusterized correlations representation: companies with similar correlations (considering as many features/correlations as stocks, i.e. 40) are assigned to the same cluster.

CAC40 clusterized correlation heatmap

We could use resulting correlations to simulate the underlying stocks of an option, and moreover through Monte Carlo simulations, the delta greek to delta hedge it (i.e. find positions on the market to reduce the exposition of the underlying stocks variations). For sure, all embarrassingly parallel jobs can benefit from the mapReduce paradigm offered by Spark. In the automated trading context, it could be interesting to let ProActive automatically deploy such application involving specialized streaming spark jobs, and let ProActive dynamically orchestrate them: some streaming jobs would be specific to trading strategies, and other streaming jobs would be in charge of market data analysis. Proactive Workflow & Scheduling afford such dynamic deployment and orchestration, in a fully scalable way.