Thursday, December 8, 2016

Big Data, Languages and Software solutions

Inspired from an article from ThinkR

Big Data is a common term that everyone uses without really having a clear definition of what it is. It is used everywhere and in every industry from transportation, human resources, healthcare, etc. which makes it more difficult to specify. For this article, let’s agree that big data is when it is not possible to handle the information with a single computer even though these days, computers are available with 1Tb of ram. In some more and more common cases, all the data for a computation might not even be in one place which requires advanced distributed algorithms and software.

Big Data for Hadley Wickham

Hadley Wickam well known developer for the R language, who developed multiple librairies used in most projects, defined 3 main categories for Big Data:

  • A category where the analysis “only” consists of selecting a sample of this large dataset usually stored with hive, teradator or other and launching some analysis on it. It represents 90% of cases.
  • A category where the Big Data can be split into Small Data which represents 9% of cases. This category is ideal for parallel distribution software.
  • A category for real Big Data needs which requires specifically sized machine that are usually difficult to access either on the cloud or in a datacenter.

Solutions

Language and Algorithms

In this Big Data ecosystem, multiple language and ecosystems have been developed to support it.

This article has been inspired by an article written by ThinkR which talks about how the R language which is suited for in-memory Big Data challenges. Working with experts like ThinkR will enable algorithms such as R for ThinkR to be more efficient and extract more valuable information.

However, depending on the situation other algorithms could be more suited. For instance, Hadoop is adapted for Big Data challenges involving thousand of computers.

Software support

To support these languages, other tools have emerged to answer challenges raised by the different categories of Big Data.

  • In the category affecting 9% of Big Data analysis, distribution is also part of the challenge. Efficiently distributing workloads onto an infrastructure while optimizing costs and time is critical to get insight on time and on budget.
  • In the final category mentioned above, the main challenge is to automate the provisioning and the selection of the right machine. It could also be required to lock it from other users or adding priorities to ensure it will be available when the algorithm requires it.
  • On each category, resources can be costly or limited, containing and managing errors is a requirements. For the most predictable errors, automatically solving them is necessary. This will ensure the resource consumed (time, CPU, Ram, etc.) is not wasted.
  • Cloud bursting abilities is required for environments where time is critical. E.g. investment bankers need to take position ahead of the market and insurance companies needs to be aware of their risk exposure for regulatory purposes and to reduce risks affecting profits.

In the scale of Big Data today, resource optimization becomes as important as algorithm optimization. Algorithm are optimized for fast analytics, in depth data analytics, visualization, etc. and software will ensure those analytics are done on time and budget while respecting IT policies and security requirements.

ActiveEon

ProActive from ActiveEon offers multiple features to answer these software needs such as:

To give a real use case, at INRA, ActiveEon was used with R and Hadoop for a quantitative analysis of the human microbiome.

All features are available and already installed on try.activeeon.com for anyone to try, For further information, do not hesitate to reach us for a demo and go through some cases studies (contact@activeeon.com).

No comments:

Post a Comment