GPU Docker Training

Alex Egg,

It’s simple enough to to dockerize an ML model. However, it is nontrivial if you want to train your model on a GPU as traditional docker patterns do not expose.

This example will use Tensorflow and AWS EC2.

ML Model w/o Docker

Consider an arbitrary TensorFlow Model.

This model can be trained by running:

python train.py \
--image_dir=s3://root-bucket/jobs/4 \
--model_path=inception_v4.ckpt \
--train_batch_size=62 \
--num_workers=100

Now if you want to run this on a more beefy machine, we can dockerize.

Dockerize ML Model

We can build off of the base Tensorflow CPU image:

FROM tensorflow/tensorflow:latest
MAINTAINER Alex Egg <[email protected]>

ENV WKDIR=/app
RUN mkdir $WKDIR
COPY . $WKDIR/

WORKDIR $WKDIR
# RUN train.py 

Now you can train your model by starting the container:

sudo docker run -ti \
	-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
	-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID\
	-e S3_REGION=$S3_REGION \
	546216854.dkr.ecr.eu-west-1.amazonaws.com/model-trainer \
	python train.py \
	--image_dir=s3://root-bucket/jobs/4 \
	--model_path=inception_v4.ckpt \
	--train_batch_size=62 \
	--num_workers=100

We can achieve higher training concurrency if we utilise GPUs, so let’s explore the GPU Option .

Docker GPU

TF supports training on Nvidia GPUs. The docker project doesn’t support Nvidia GPU hardware natively. However, there is a thin wrapper around Docker provided by Nvidia that allows you to access the GPU hardware from a docker container.

First let’s slightly modify our Dockerfile to use a TF build optimised for GPUs:

FROM tensorflow/tensorflow:latest-gpu
MAINTAINER Alex Egg <[email protected]>
...

Now we need to get a computer w/:

Luckily, AWS already has CUDA 9 drivers installed for the new Tesla V100 GPUs on the P3x instances if we use the AWS Deep Learning Ubuntu AMI. So we can leverage that and only install docker & nividia-docker.

Inside the Deep Learning AMI, let’s install the docker stack:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get update
sudo apt-get install -y docker-ce

wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i /tmp/nvidia-docker_1.0.1-1_amd64.deb && rm /tmp/nvidia-docker_1.0.1-1_amd64.deb




sudo nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash

You can test it here by check if you can query the device:

sudo nvidia-docker run --rm nvidia/cuda nvidia-smi

You should see something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   71C    P0   242W / 300W |  15526MiB / 16152MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23194      C   python                                     15516MiB |
+-----------------------------------------------------------------------------+

and check if TensorFlow can access the device:

sudo nvidia-docker run --rm nvidia/cuda nvidia-smi

Then put this in the python REPL:

import tensorflow as tf
a = tf.constant(5, tf.float32)
b = tf.constant(5, tf.float32)
with tf.Session() as sess:
    sess.run(tf.add(a, b)) # output is 10

Then, once you have confirmed your ENV is setup, you can start the training image:

sudo nvidia-docker run -ti \
	-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
	-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID\
	-e S3_REGION=$S3_REGION \
	546216854.dkr.ecr.eu-west-1.amazonaws.com/model-trainer \
	python train.py \
	--image_dir=s3://root-bucket/jobs/4 \
	--model_path=inception_v4.ckpt \
	--train_batch_size=62 \
	--num_workers=100

TensorBoard

As of TF 1.4 there is a filesystem plugin for S3, so you can setup your logging in a bucket:

train_writer = tf.summary.FileWriter("s3://my-model/summaries" + '/train', graph)

Permalink: gpu-docker-training

Tags:

Last edited by Alex Egg, 2017-12-05 13:07:42
View Revision History