Profiling Tensorflow workloads with Intel VTune Amplifier

Published Date
13 - Sep - 2017
| Last Updated
14 - Sep - 2017
 
Profiling Tensorflow workloads with Intel VTune Amplifier

Machine learning applications are very compute intensive by their nature. That is why optimization for performance is quite important for them. One of the most popular libraries, Tensorflow*, already has an embedded timeline feature that helps understand which parts of the computational graph are causing bottlenecks but it lacks some advanced features like an architectural analysis. In this short tutorial, we will show how to combine the data provided by Tensorflow.timeline with options available in one of the most powerful performance profilers for Intel Architecture – Intel® VTune™ Amplifier.

Tensorflow.timeline generates the data in Trace Event Format that cannot be consumed by VTune Amplifier directly but can be converted to .csv format it supports. We will do this conversion at the end of the collection with the help of a special custom collector script listed below:

#! /bin/sh

if [ "$#" -ne 1 ]; then
    echo "Usage: collect.sh json_dir"
    exit 1
fi

JSON_FILES=$1/*.json

case "$AMPLXE_COLLECT_CMD" in
"start")
    rm -rf $JSON_FILES
    ;;

"stop")
    for f in $JSON_FILES
    do
        python $(dirname "$0")/convert.py $f $AMPLXE_HOSTNAME $AMPLXE_DATA_DIR
    done
    ;;

"pause")
    ;;
"resume")
    ;;

*)
    echo "unexpected value of AMPLXE_COLLECT_CMD"
    ;;
esac
This script uses a helper conver.py Python* script shown here:
#!/usr/bin/env python

import sys
import json
import os
import socket
import datetime

def convertTime(t):
    return datetime.datetime.utcfromtimestamp(t / 1000000.0)

if len(sys.argv) < 4:
    print("Usage: convert.py input_file.json host output_dir")
    exit(1)

fnInp = sys.argv[1]
host = sys.argv[2]
outPath = sys.argv[3]
fnOut = os.path.basename(sys.argv[1])
fnOut = os.path.splitext(fnOut)[0]
fnOut = os.path.join(outPath, fnOut + '-hostname-' + host + '.csv')

fInp = open(fnInp, 'r')
fOut = open(fnOut, 'w')

trace = json.load(fInp)
fOut.write('name,start_tsc.UTC,end_tsc,pid,tid\n')

for event in trace['traceEvents']:
    if event['ph'] == 'X':
        t = int(event['ts'])
        tbUtc = convertTime(t)
        teUtc = convertTime(t + int(event['dur']))
        s = event['name'] + ','
        s += str(tbUtc) + ','
        s += str(teUtc) + ','
        s += ',\n'
        fOut.write(s)

When configuring a VTune Amplifier project, go to the Analysis Target window and specify the path to the collect.sh script and a path to the .json files generated by Tensorflow.timeline  in the Custom collectorfield as follows:

$ <path_to_collect.sh>/collect.sh <path_to_dir_with_json_files>

For example:

The script accepts one parameter:a path to the .json files generated by Tensorflow.timeline, which should be specified for the custom collector script. The script will automatically pick up the .json files from that directory at the end of collection, convert them to the .csv format, put the converted files to the result directory next to other traces collected by VTune Amplifier. When collection is done, VTune Amplifier automatically loads all the data and shows everything on the same timeline, correlated:

and aggregated:

The example above uses the Source Function / Function / Call Stack grouping instead of the default Function / Call Stack since Tensorflow was built with Intel® Math Kernel Library for Deep Neural Networks (Intel MKL-DNN)  support which does JITting. As a result, Intel MKL-DNN in some cases generates multiple instances of the same function. With the default Function / Call Stack grouping, the VTune Amplifier would show these instances as different functions, which could lead to an incorrect interpretation of the result where each instance is not hot by itself but the accumulation of all of them would be the hotspot.

The described technique allows to apply a full power of analyses available in the VTune Amplifier to Tensorflow-based applications. For instance, finding operations caused by the hotspots is just a matter of applying a proper Source Function / Function Domain grouping. This grouping can be configured manually as a custom grouping:

For more such intel IoT resources and tools from Intel, please visit the Intel® Developer Zone

Source:https://software.intel.com/en-us/articles/profiling-tensorflow-workloads-with-intel-vtune-amplifier