Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel® Xeon Phi™ coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128×128.
Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor.
The code discussed in the paper can be freely downloaded from https://github.com/ColfaxResearch/LU-decomposition
For more such intel IoT resources and tools from Intel, please visit the Intel® Developer Zone
Other Popular Deals
- The 10 best job hunting apps on AndroidTop 10 Android launchers (June 2017)
- The 10 scariest horror games on AndroidAndroid app stores: 5 best alternatives to Google Play Store
- Best Android apps for rooted devices18 apps for a memorable Valentine's Day
- Top 15 Android games that you should playThe 20 best looking games for mobile phones
- 8 Android apps to tickle your funny bone17 must have apps for any Android device (plus alternatives)
- 5 apps to get the Android Lollipop look on your smartphoneTake control of your Android device with these apps
- 7 weird and strange apps for your Smartphone10 neat Google apps you may not know of
- Perfect Viewer10 essential Indian apps for Android devices