Overview
In this post I will describe the steps required to get Tensorflow running with OpenCL on an AMD GPU.
Install AMDGPU-PRO Drivers
Install with (applying sudo credentials where necessary):
wget -referer=https://support.amd.com https://www2.ati.com/drivers/linux/ubuntu/amdgpu-pro-17.50-511655.tar.xz
tar -xvf amdgpu-pro-17.50-511655.tar.xz
cd amdgpu-pro-17.50-511655
./amdgpupro-install --opencl=legacy --headless
sudo reboot
Install a recent python
I use pyenv to manage different versions of python, setup your python as follows:
curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash
Add the following lines to your ~/.bashrc
file:
export PATH="~/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
Restart your terminal and you have pyenv installed. Now install a python (I’ll use the latest 3.6) and its system dependencies:
sudo apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev
env PYTHON_CONFIGURE_OPTS="--enable-shared" MAKEOPTS="-j 8" pyenv install 3.6.5
pyenv local 3.6.5
pip install wheel numpy
Get The ComputeCPP SYCL Implementation
Tensorflow can use the SYCL interface to seamlessly run device agnostic c++ code on an OpenCL enabled device. There is an open source template based library called triSYCL. It requires c++17 support and thus you probably need to build your own gcc from a recent release. I won’t cover it here but you can if you want to.
Sign up here and get your computecpp library:
https://developer.codeplay.com/computecppce/latest/overview
tar -xvf ComputeCpp-CE-0.6.1-Ubuntu.16.04-64bit.tar.gz
sudo mv ComputeCpp-CE-0.6.1-Ubuntu-16.04-64bit /usr/local/computecpp
Get Tensorflow with experimental opencl support
This fork of tensorflow is maintained by someone from Codeplay, who make ComputeCPP. It’s highly experimental, expect it to change in the future. We use the dev/amd_gpu branch which is currently under active development:
git clone https://github.com/lukeiwanski/tensorflow
cd tensorflow
git fetch
git checkout dev/amd_gpu
pyenv local 3.6.5
In order to build tensorflow, you need to use Google’s bazel build system:
wget https://github.com/bazelbuild/bazel/releases/download/0.11.1/bazel_0.11.1-linux-x86_64.deb
sudo dpkg -i bazel_0.11.1-linux-x86_64.deb
Environment and Building
You can now configure and build tensorflow, from the tensorflow directory run:
./configure
I used the following options (press enter to get the defaults):
$ ./configure
You have bazel 0.11.1 installed.
Please specify the location of python. [Default is /home/user/.pyenv/versions/3.6.5/bin/python]:
Please input the desired Python library path to use. Default is [/home/user/.pyenv/versions/3.6.5/lib/python3.6/site-packages]
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y
jemalloc as malloc support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Apache Kafka Platform support? [y/N]: n
No Apache Kafka Platform support will be enabled for TensorFlow.
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.
Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: y
OpenCL SYCL support will be enabled for TensorFlow.
Please specify which C++ compiler should be used as the host C++ compiler. [Default is /usr/bin/g++]:
Please specify which C compiler should be used as the hostC compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with ComputeCPP support? [Y/n]: Y
ComputeCPP support will be enabled for TensorFlow.
Please specify the location where ComputeCpp for SYCL 1.2 is installed. [Default is /usr/local/computecpp]:
Do you wish to build TensorFlow with double types in SYCL support? [Y/n]: Y
double types in SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with half types in SYCL support? [y/N]: n
No half types in SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: n
No CUDA support will be enabled for TensorFlow.
Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Now you can build the package, consider going to get lunch because this will take a while:
bazel build --config=opt --config=sycl //tensorflow/tools/pip_package:build_pip_package
Hopefully it built without errors, if it did, try to fix them. Now install the generated wheel file:
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tf_cpp_whl
pip install /tmp/tf_cpp_whl/tensorflow-1.6.0rc0-cp27-cp27mu-linux_x86_64.whl
Now you’ve got tensorflow installed, but it’s not over yet, you need everything setup nicely to point to the appropriate OpenCL ICD and use everything in your updated amdgpu-pro drivers. Consider putting these environment variables in your ~/.bashrc
file:
export $OPENCL_VENDOR_PATH=/opt/amdgpu-pro/etc/OpenCL/vendors/
mkdir -p $OPENCL_VENDOR_PATH
echo "/opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl64.so" > $OPENCL_VENDOR_PATH/amdocl64.icd
export LD_LIBRARY_PATH=/usr/local/computecpp/lib:/opt/amdgpu-pro/lib/x86_64-linux-gnu
export PATH=/usr/local/computecpp/bin:/opt/amdgpu-pro/bin:$PATH
Here’s some explanation:
OPENCL_VENDOR_PATH
is where libOpenCL.so will look for.icd
files which point to a specific library for a driver. In our case it’s thelibamdocl64.so
, each one will appear as a different vendor, so you could install pocl or oclgrindLD_LIBRARY_PATH
is where the system looks for shared object filesPATH
is where the system looks like for executable files. The two listed above contain clinfo and computecpp_info
Run clinfo
and you should get something like this describing your GPU(s):
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (2527.3)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon HD 8500 Series
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 6
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 780Mhz
Address bits: 64
Max memory allocation: 330696294
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 505212928
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x7f9728fbf510
Name: Oland
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2527.3
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2527.3)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
Running computecpp_info
should get you something like this:
********************************************************************************
ComputeCpp Info (CE 0.6.1)
********************************************************************************
Toolchain information:
GLIBC version: 2.23
GLIBCXX: 20160609
This version of libstdc++ is supported.
********************************************************************************
Device Info:
Discovered 1 devices matching:
platform : <any>
device type : <any>
--------------------------------------------------------------------------------
Device 0:
Device is supported : UNTESTED - Vendor not tested on this OS
CL_DEVICE_NAME : Oland
CL_DEVICE_VENDOR : Advanced Micro Devices, Inc.
CL_DRIVER_VERSION : 2527.3
CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU
If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.6.1/platform-support-notes
********************************************************************************
Verification
You can now quickly check that everything installed correctly (don’t run this from the tensorflow source directory, you’ll run into issues):
cd ~
pyenv local 3.6.5
python -c "from tensorflow.python.client import device_lib;device_lib.list_local_devices()"
2018-04-04 16:39:13.673362: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-04-04 16:39:13.673394: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Oland, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
Now you should be able to run some code that uses the AMD GPU! Here’s a small example using keras (which you need to install):
pip install keras
Here’s a toy regression example (it learns the sin function) that you can use to compare the speed of CPU vs. GPU:
#!/usr/bin/python
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Activation
import numpy as np
x = np.arange(10000)
y = np.sin(x)
for i in ('/cpu:0', '/gpu:0'):
with tf.device(i):
model = Sequential([
Dense(128, input_shape=(1,), activation='relu'),
Dense(1),
])
model.compile(optimizer='adam', loss='mse')
model.fit(x,y, validation_split=0.25, epochs=10)
Running this I get the following output:
Using TensorFlow backend.
Train on 7500 samples, validate on 2500 samples
Epoch 1/10
2018-04-04 16:53:55.766232: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-04-04 16:53:55.766265: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Oland, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
7500/7500 [==============================] - 0s 53us/step - loss: 29006.8066 - val_loss: 0.5027
Epoch 2/10
7500/7500 [==============================] - 0s 25us/step - loss: 0.5071 - val_loss: 0.5019
Epoch 3/10
7500/7500 [==============================] - 0s 26us/step - loss: 0.5091 - val_loss: 0.5005
Epoch 4/10
7500/7500 [==============================] - 0s 25us/step - loss: 0.5105 - val_loss: 0.5007
Epoch 5/10
7500/7500 [==============================] - 0s 30us/step - loss: 0.5166 - val_loss: 0.5006
Epoch 6/10
7500/7500 [==============================] - 0s 27us/step - loss: 0.5141 - val_loss: 0.6930
Epoch 7/10
7500/7500 [==============================] - 0s 26us/step - loss: 0.5173 - val_loss: 0.5968
Epoch 8/10
7500/7500 [==============================] - 0s 27us/step - loss: 0.5162 - val_loss: 0.5382
Epoch 9/10
7500/7500 [==============================] - 0s 26us/step - loss: 0.5195 - val_loss: 0.5377
Epoch 10/10
7500/7500 [==============================] - 0s 25us/step - loss: 0.5160 - val_loss: 0.5020
Train on 7500 samples, validate on 2500 samples
Epoch 1/10
7500/7500 [==============================] - 92s 12ms/step - loss: nan - val_loss: nan
Epoch 2/10
7500/7500 [==============================] - 2s 265us/step - loss: nan - val_loss: nan
Epoch 3/10
7500/7500 [==============================] - 2s 262us/step - loss: nan - val_loss: nan
Epoch 4/10
7500/7500 [==============================] - 2s 268us/step - loss: nan - val_loss: nan
Epoch 5/10
7500/7500 [==============================] - 2s 262us/step - loss: nan - val_loss: nan
Epoch 6/10
7500/7500 [==============================] - 2s 268us/step - loss: nan - val_loss: nan
Epoch 7/10
7500/7500 [==============================] - 2s 268us/step - loss: nan - val_loss: nan
Epoch 8/10
7500/7500 [==============================] - 2s 259us/step - loss: nan - val_loss: nan
Epoch 9/10
7500/7500 [==============================] - 2s 258us/step - loss: nan - val_loss: nan
Epoch 10/10
7500/7500 [==============================] - 2s 264us/step - loss: nan - val_loss: nan
As you can see, in the GPU run, it takes a long time to run the first epoch of the GPU the model, and significantly longer to run each training step (also that the GPU learnt nothing). You can play with the code to see if you can get it to work but waiting ~90s for the model to start the first epoch is a pain for testing.
It might just be I have an old workstation card, or that there is a lot of overhead involved in using SYCL, I hope this helps someone else!