Modernizing Software with Future-Proof Code Optimizations

Published Date
14 - Sep - 2017
| Last Updated
14 - Sep - 2017
Modernizing Software with Future-Proof Code Optimizations

Create High Performance, Scalable and Portable Parallel Code with New Intel® Parallel Studio XE 2018

Intel® Parallel Studio XE is our flagship product for software development, debugging, and tuning on Intel processor architectures for HPC, enterprise, and cloud computing. It is a comprehensive tool suite that contains everything from compilers and high-performance math libraries all the way to debuggers and profilers for large-scale cluster applications. These tools enable developers to exploit the full performance potential of Intel® processors. Intel Parallel Studio XE is designed to help developers create high performance, scalable, reliable parallel code—faster.

The latest release, Intel Parallel Studio XE 2018, contains many new and interesting features [1]. Let’s start with parallelism. It’s in the product name, after all. Software development and parallelism used to be separate concerns, and parallel computing was mainly confined to high-performance computing practitioners. Today, however, parallel architectures are ubiquitous. Multicore processors are now in handheld devices—all the way up to the world’s most powerful supercomputers.

The Intel® Compilers support the OpenMP* 4.5 standard for compiler-directed multithreading, plus initial support for the 5.0 draft.

OpenMP is now 20 years old and continues to evolve with new hardware architectures [2, 3, 4]. The latest versions provide computation offload to accelerator devices, vectorization directives, enhanced control of thread placement, and much more [5]. For distributed-memory process-level parallelism, the Intel® MPI Library supports the latest message-passing interface (MPI) standard, and contains many optimizations for collective communication, job startup and shutdown, and support for the latest high-speed interconnects like the Intel® Omni-Path Architecture (Intel® OPA). Combining OpenMP and MPI in the same application has proven to be a powerful way to achieve scalable parallelism on modern clusters.

The number of cores per socket has steadily increased since the first multicore processor was released, but while higher-level parallelism is important, lower-level code tuning should not be ignored. In fact, parallelizing code that has not been properly tuned can be counterproductive. There are few things more disheartening than going through the effort of parallelizing an application only to find that vectorizing a few key loops gives better performance and renders the previous parallelization unnecessary. Vectors continue to get wider in modern processor architectures so the Intel compilers contain many new enhancements to enable efficient vectorization [6]. In addition to the OpenMP vectorization directives mentioned above, the Intel compilers exploit the latest Intel® Advanced Vector Extensions (Intel® AVX-512) instructions in Intel® Xeon® Scalable and Xeon Phi™ processor architectures [7].

The compilers in Intel Parallel Studio XE 2018 support the latest Fortran, C, and C++ standards. More recently, the Intel® Distribution for Python* was added to the suite. Our optimized Python distribution integrates the Intel® Performance Libraries into many Python packages (e.g., NumPy, SciPy, scikit-learn, mpi4py). (Other productivity languages like Julia* [8] and R* [9, 10], which are not part of the product, can also take advantage of the Intel performance libraries.) Intel Parallel Studio XE 2018 also includes the following highly-optimized libraries: Intel® Math Kernel Library (Intel® MKL), Intel® Integrated Performance Primitives (Intel® IPP), the Intel® Data Analytics Acceleration Library (Intel® DAAL), the Intel® MPI Library, and the Intel® Threading Building Blocks (Intel TBB). Intel® MKL provides tuned, parallel math functions for dense and sparse linear algebra, Fourier transforms, neural networks, random number generation, basic statistics, etc. The latest version contains new APIs to improve the performance of the bulk matrix multiplication and convolution required during neural network training. Common computations in image processing, computer vision, signal processing, compression/decompression, cryptography, and string processing are available in Intel® IPP [11]. The newest library in the suite, Intel® DAAL, supports basic statistics and machine learning (e.g., dimensionality reduction, anomaly detection, classification, regression, clustering) [9, 12, 13, 14].

For C++ programmers, Intel continues to support Intel® TBB (, the widely-used template library for task parallelism [15]. (Note that in spite of the name, Intel TBB is open-sourced under an Apache 2.0 license. Intel has always preferred open, vendor-neutral standards over proprietary programming models.) Intel TBB fully leverages multicore processors but its most exciting new feature is the flow graph coordination layer. Flow graph allows the programmer to describe complex workflows that the Intel TBB runtime uses to extract parallelism. Intel TBB flow graph could become the preferred parallel programming model for heterogeneous processor environments. Intel Parallel Studio XE 2018 contains a preview feature under Intel Advisor called Flow Graph Analyzer to help create optimize flow graphs [16].

In addition to compilers and performance libraries, Intel Parallel Studio XE 2018 contains powerful code analysis tools to assist with debugging and tuning at instruction-, thread-, and process-level parallelism. Intel® Inspector is a one-of-a-kind debugger that not only finds garden-variety bugs like memory leaks but also performs correctness checking on threaded code to identify data races, potential deadlocks, and other non-deterministic concurrency errors. Intel® VTune™ Amplifier provides basic profiling to find performance hotspots but it does so much more, e.g.: microarchitecture analysis, memory and I/O analysis, etc. Its latest release adds support for profiling applications running in containers and the new Application Performance Snapshot feature provides a one-page overview of an application’s efficiency and performance characteristics across MPI, CPU, FPU, and memory use. Intel® Advisor, another one-of-a-kind tool, allows users to quickly prototype regions for potential parallelism and project likely speedup.

However, its most exciting new feature is cache-aware roofline analysis, which pinpoints underperforming loops, graphically shows which are good candidates for code tuning, and gives advice about the likely performance bottlenecks [6, 17]. The Intel® Trace Analyzer and Collector performs correctness checking and communication profiling of MPI applications.Its latest version now supports OpenSHMEM (, an open standard API for parallelism in a partitioned global address space (PGAS). PGAS could become an important programming model for future parallel systems. Finally, Intel® Cluster Checker, a tool for analyzing the cluster health, added new features to improve usability and diagnostic output, check Intel® Omni-Path Architecture (Intel® OPA), and much more [18].

Few Intel Parallel Studio XE users realize how much this tool suite has evolved, how mature some of its components really are (20 years+), and how it has driven new approaches and helped developers accelerate parallel programming performance significantly over the last decade. However, its design goal has remained the same – to enable future-proof code modernization. For example, the same cache optimization techniques (e.g., blocking and tiling) that were beneficial 20 years ago are still beneficial. Today, however, code modernization is about exploiting parallelism – starting with vectorization (instruction-level parallelism), then threading, and finally message-passing on distributed-memory clusters. What does the future hold: heterogeneous parallelism, PGAS languages, persistent memory, etc.? Whatever the future holds, Intel Parallel Studio XE will evolve accordingly.

For more such intel IoT resources and tools from Intel, please visit the Intel® Developer Zone