Inspired from an article from ThinkR
Big Data is a common term that everyone uses without really having a clear definition of what it is. It is used everywhere and in every industry from transportation, human resources, healthcare, etc. which makes it more difficult to specify. For this article, let’s agree that big data is when it is not possible to handle the information with a single computer even though these days, computers are available with 1Tb of ram. In some more and more common cases, all the data for a computation might not even be in one place which requires advanced distributed algorithms and software.
Big Data for Hadley Wickham
Hadley Wickam well known developer for the R language, who developed multiple librairies used in most projects, defined 3 main categories for Big Data:
- A category where the analysis “only” consists of selecting a sample of this large dataset usually stored with hive, teradator or other and launching some analysis on it. It represents 90% of cases.
- A category where the Big Data can be split into Small Data which represents 9% of cases. This category is ideal for parallel distribution software.
- A category for real Big Data needs which requires specifically sized machine that are usually difficult to access either on the cloud or in a datacenter.
Language and Algorithms
In this Big Data ecosystem, multiple language and ecosystems have been developed to support it.
This article has been inspired by an article written by ThinkR which talks about how the R language which is suited for in-memory Big Data challenges. Working with experts like ThinkR will enable algorithms such as R for ThinkR to be more efficient and extract more valuable information.
However, depending on the situation other algorithms could be more suited. For instance, Hadoop is adapted for Big Data challenges involving thousand of computers.
To support these languages, other tools have emerged to answer challenges raised by the different categories of Big Data.
- In the category affecting 9% of Big Data analysis, distribution is also part of the challenge. Efficiently distributing workloads onto an infrastructure while optimizing costs and time is critical to get insight on time and on budget.
- In the final category mentioned above, the main challenge is to automate the provisioning and the selection of the right machine. It could also be required to lock it from other users or adding priorities to ensure it will be available when the algorithm requires it.
- On each category, resources can be costly or limited, containing and managing errors is a requirements. For the most predictable errors, automatically solving them is necessary. This will ensure the resource consumed (time, CPU, Ram, etc.) is not wasted.
- Cloud bursting abilities is required for environments where time is critical. E.g. investment bankers need to take position ahead of the market and insurance companies needs to be aware of their risk exposure for regulatory purposes and to reduce risks affecting profits.
In the scale of Big Data today, resource optimization becomes as important as algorithm optimization. Algorithm are optimized for fast analytics, in depth data analytics, visualization, etc. and software will ensure those analytics are done on time and budget while respecting IT policies and security requirements.
ProActive from ActiveEon offers multiple features to answer these software needs such as:
- Cloud bursting on AWS, Azure, Softlayer, etc.
- Resource management (public, hybrid, private, multi clouds, VMs, containers)
- Advanced error management
- User friendly UI for efficient distribution onto any infrastructure
- Customizable selection scripts to target the right resource
All features are available and already installed on try.activeeon.com for anyone to try, For further information, do not hesitate to reach us for a demo and go through some cases studies (email@example.com).