GPU Benchmark : Training a semi-supervised VAE on SVHN dataset

Our goal is to do a head-to-head comparison of how well different single GPU hardware configurations do in terms of speed and memory consumption when training a large probabilistic model (semi-supervised VAE with Wide ResNet 28 decoder) on the benchmark SVHN dataset.

Configurations tested

We tested a range of GPU hardware available either on the Tufts HPC cluster, other clusters Mike has access too, and a recent standalone Exxact workstation.

Specs for each configuration are:

T4 8 TFLOPS + 16 GBIntel Xeon Gold 6248 CPU @ 2.50GHzTufts HPC cluster
P100 9 TFLOPS + 16 GBIntel Xeon CPU E5-2695 v4 @ 2.10GHzTufts HPC cluster
RTX2080Ti13 TFLOPS + 11 GBIntel i9-9940X CPU @ 3.30GHzExxact node
RTX2080Tiv213 TFLOPS + 11 GBIntel Xeon Bronze 3106 CPU @ 1.70GHzExternal cluster
V10015 TFLOPS + 32 GBIntel Xeon CPU E5-2695 v4 @ 2.10GHzTufts HPC cluster

We have roughly ordered from weakest to fastest GPU (in terms of TFLOPS).

All computations will be done using float32 (single-precision).

Goals

Certainly, there will be some batchsize at which V100 is superior. But the key questions are:

  • But do we reach that with this application?
  • How much further behind are the more affordable nodes?

Print Raw Benchmark Data

GPU GPU_hardware batchsize mean_runtime mean_gpu_mem mean_cpu_mem fraction_of_v100_runtime
0 T4 T4 8 TFLOPS + 16 GB 25 404.77 2655.0 4282.65 1.39
0 T4 T4 8 TFLOPS + 16 GB 50 393.10 4703.0 4347.52 1.94
0 P100 P100 9 TFLOPS + 16 GB 25 342.30 2591.0 3709.99 1.17
0 P100 P100 9 TFLOPS + 16 GB 50 270.58 4639.0 3735.52 1.34
0 P100 P100 9 TFLOPS + 16 GB 100 227.23 8735.0 3905.05 1.44
0 RTX2080Ti 2080Ti 13 TFLOPS + 11 GB 50 233.06 4765.0 5106.88 1.15
0 RTX2080Ti 2080Ti 13 TFLOPS + 11 GB 100 171.66 8861.0 5383.15 1.09
0 RTX2080Ti 2080Ti 13 TFLOPS + 11 GB 200 NaN NaN NaN NaN
0 RTX2080Ti 2080Ti 13 TFLOPS + 11 GB 400 NaN NaN NaN NaN
0 RTX2080Ti 2080Ti 13 TFLOPS + 11 GB 800 NaN NaN NaN NaN
0 RTX2080Ti 2080Ti 13 TFLOPS + 11 GB 1600 NaN NaN NaN NaN
0 BrownRTX2080Ti 2080Tiv2 13 TFLOPS + 11 GB 25 425.71 2715.0 5747.67 1.46
0 BrownRTX2080Ti 2080Tiv2 13 TFLOPS + 11 GB 50 282.31 4763.0 5614.73 1.40
0 BrownRTX2080Ti 2080Tiv2 13 TFLOPS + 11 GB 100 208.90 8859.0 5878.01 1.32
0 BrownRTX2080Ti 2080Tiv2 13 TFLOPS + 11 GB 200 NaN NaN NaN NaN
0 V100 V100 15 TFLOPS + 32 GB 25 291.72 2905.0 4228.56 1.00
0 V100 V100 15 TFLOPS + 32 GB 50 202.17 4953.0 4158.98 1.00
0 V100 V100 15 TFLOPS + 32 GB 100 158.20 9049.0 4279.63 1.00
0 V100 V100 15 TFLOPS + 32 GB 200 130.07 17241.0 4709.92 1.00
0 V100 V100 15 TFLOPS + 32 GB 400 92.70 31443.0 5188.68 1.00
0 V100 V100 15 TFLOPS + 32 GB 800 74.02 31443.0 5135.33 1.00
0 V100 V100 15 TFLOPS + 32 GB 1600 NaN NaN NaN NaN

Absolute Runtime vs batch size

Figure 1: Absolute runtimes to traverse full training set (50,000 examples). Reporting average runtime over 5 complete epochs (discarding the runtime of the first epoch since it always has overhead). Any missing data indicates "out of memory" occurred at that batch size. Reported TFLOPS are for float32 operations.

Relative Runtime vs batch size (reference = V100)

Figure 2: Relative runtime comparison on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Any y-axis value below 1.0 would indicate the method is faster than V100 at that batchsize. Reporting average runtime over 5 complete epochs (discarding the runtime of the first epoch since it always has overhead). Any missing data indicates "out of memory" occurred at that batch size. Reported TFLOPS are for float32 operations.

GPU memory usage vs batch size

Figure 3: GPU memory consumption on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Averaged over 5 epochs (we observed very little fluctuation after 1 epoch). Any missing data indicates "out of memory" occurred at that batch size. Trained using gradient descent in Tensorflow with gpu memory growth enabled.

CPU memory usage vs batch size

Figure 4: CPU memory consumption on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Averaged over 5 epochs (we observed very little fluctuation after 1 epoch). Any missing data indicates "out of memory" occurred at that batch size. Trained using gradient descent in Tensorflow with gpu memory growth enabled.