Improve the performance and flexibility of large data in a containerized environment

Published Date
11 - May - 2017
| Last Updated
11 - May - 2017
 
Improve the performance and flexibility of large data in a contai...

The use of Apache * Hadoop *, Apache * Spark * and other large data framework analysis of large data sets will be in-depth insight for enterprises to provide greater business value. Due to the large size of the data, in some cases, execution and processing tasks in large compute clusters can consume hours. The cost of resources is very high and the cost of the task is inversely proportional to the throughput, so performance is the most important.

In order to ensure the best performance, most companies use the bare metal server to deploy internal large data analysis. Up to now, many IT departments are still reluctant to use virtual machines or containers to implement large data. Often the processing overhead and input / output (I / O) delay associated with virtualization and containerization are the main cause of this phenomenon.

As a result, most of the internal large data plans are subject to flexibility. Deployment on traditional bare-metal installations typically takes weeks to months, which affects the adoption of Hadoop, Spark, and other large data deployments. Because most cloud services run on virtual machines, public clouds may degrade performance. Nonetheless, more and more analysts and data scientists use public clouds for large data in order to increase flexibility.

Intel and BlueData * collaboration

About a year and a half ago, Intel announced that BlueData signed an " investment and cooperation agreement " , is committed to solving the above problems. BlueData's EPIC * software platform leverages the inherent deployment flexibility of Docker * containers to speed up the deployment of large data. The container-based cluster in the BlueData platform is very similar in appearance and style to standard physical clusters in bare metal deployments, and does not change Hadoop and other large data frames. Can be implemented in internal, public cloud or hybrid cloud architecture.

With BlueData, organizations can quickly and easily deploy large data (flexible, on-demand self-service Hadoop or Spark clusters provide a large data-as-a-service experience), while reducing costs. The BlueData platform is tailored to the performance requirements for large data. For example, BlueData uses hierarchical data caching and tiered to enhance the I / O performance and scalability of container-based clusters. It also supports multiple user groups to safely share the same cluster resources, each group no longer needs a dedicated large data infrastructure, greatly reducing the complexity. 

To ensure the flexibility and high performance of large data deployments, Intel tested and tested the BlueData EPIC platform and helped enhance the performance of the platform to drive strategic collaboration and strategic collaboration with BlueData. We work closely with BlueData to demonstrate that BlueData software innovation offers comparable performance for large data workloads such as Hadoop and Spark, compared to bare metal deployments through proven and quantifiable performance metrics. 

Performance evaluation results

Intel runs the same internal large data workload in BlueData and bare metal environments, and compares performance differences with performance metrics. Use the BigBench Performance Evaluation Suite to perform the latest tests to make the same configuration for Intel® Xeon® processor- based architectures to ensure similar comparisons.

In-depth studies have shown that the container-based Hadoop workload performance ratio on BlueData EPIC is equal to (in some cases slightly higher than) bare metal Hadoop. For example, the use of 50 Hadoop computing nodes and 10 TB data evaluation performance and found that compared to bare metal performance, BlueData EPIC platform, the average performance increase of 2.33%. This is a milestone step forward in the ongoing collaboration between Intel and the BlueData software engineering team.

This means that companies do not need to choose between performance and flexibility, they can take into account the two, in the internal deployment to ensure the performance of large data analysis and flexibility.With the BlueData EPIC software and the Intel Xeon processor, the Docker container is flexible and cost-effective, while ensuring good performance for bare metal. The data science team leverages enterprise-class data management and security in a multi-tenant architecture to gain the advantage of on-demand access to large data environments. As a result, BlueData EPIC software running on Intel architecture has become the preferred solution stack for many large data plans.

For more such intel IoT resources and tools from Intel, please visit the Intel® Developer Zone

Source:https://software.intel.com/zh-cn/blogs/2017/03/14/performance-and-agility-with-big-data