Our goal is to do a head-to-head comparison of how well different single GPU hardware configurations do in terms of speed and memory consumption when training a large probabilistic model (semi-supervised VAE with Wide ResNet 28 decoder) on the benchmark SVHN dataset.
We tested a range of GPU hardware available either on the Tufts HPC cluster, other clusters Mike has access too, and a recent standalone Exxact workstation.
Specs for each configuration are:
T4 | 8 TFLOPS + 16 GB | Intel Xeon Gold 6248 CPU @ 2.50GHz | Tufts HPC cluster |
P100 | 9 TFLOPS + 16 GB | Intel Xeon CPU E5-2695 v4 @ 2.10GHz | Tufts HPC cluster |
RTX2080Ti | 13 TFLOPS + 11 GB | Intel i9-9940X CPU @ 3.30GHz | Exxact node |
RTX2080Tiv2 | 13 TFLOPS + 11 GB | Intel Xeon Bronze 3106 CPU @ 1.70GHz | External cluster |
V100 | 15 TFLOPS + 32 GB | Intel Xeon CPU E5-2695 v4 @ 2.10GHz | Tufts HPC cluster |
We have roughly ordered from weakest to fastest GPU (in terms of TFLOPS).
All computations will be done using float32 (single-precision).
Certainly, there will be some batchsize at which V100 is superior. But the key questions are:
Figure 1: Absolute runtimes to traverse full training set (50,000 examples). Reporting average runtime over 5 complete epochs (discarding the runtime of the first epoch since it always has overhead). Any missing data indicates "out of memory" occurred at that batch size. Reported TFLOPS are for float32 operations.
Figure 2: Relative runtime comparison on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Any y-axis value below 1.0 would indicate the method is faster than V100 at that batchsize. Reporting average runtime over 5 complete epochs (discarding the runtime of the first epoch since it always has overhead). Any missing data indicates "out of memory" occurred at that batch size. Reported TFLOPS are for float32 operations.
Figure 3: GPU memory consumption on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Averaged over 5 epochs (we observed very little fluctuation after 1 epoch). Any missing data indicates "out of memory" occurred at that batch size. Trained using gradient descent in Tensorflow with gpu memory growth enabled.
Figure 4: CPU memory consumption on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Averaged over 5 epochs (we observed very little fluctuation after 1 epoch). Any missing data indicates "out of memory" occurred at that batch size. Trained using gradient descent in Tensorflow with gpu memory growth enabled.