GPU Benchmark : Training a semi-supervised VAE on SVHN dataset¶

Our goal is to do a head-to-head comparison of how well different single GPU hardware configurations do in terms of speed and memory consumption when training a large probabilistic model (semi-supervised VAE with Wide ResNet 28 decoder) on the benchmark SVHN dataset.

Configurations tested¶

We tested a range of GPU hardware available either on the Tufts HPC cluster, other clusters Mike has access too, and a recent standalone Exxact workstation.

Specs for each configuration are:

T4	8 TFLOPS + 16 GB	Intel Xeon Gold 6248 CPU @ 2.50GHz	Tufts HPC cluster
P100	9 TFLOPS + 16 GB	Intel Xeon CPU E5-2695 v4 @ 2.10GHz	Tufts HPC cluster
RTX2080Ti	13 TFLOPS + 11 GB	Intel i9-9940X CPU @ 3.30GHz	Exxact node
RTX2080Tiv2	13 TFLOPS + 11 GB	Intel Xeon Bronze 3106 CPU @ 1.70GHz	External cluster
V100	15 TFLOPS + 32 GB	Intel Xeon CPU E5-2695 v4 @ 2.10GHz	Tufts HPC cluster

We have roughly ordered from weakest to fastest GPU (in terms of TFLOPS).

All computations will be done using float32 (single-precision).

Goals¶

Certainly, there will be some batchsize at which V100 is superior. But the key questions are:

But do we reach that with this application?
How much further behind are the more affordable nodes?

Print Raw Benchmark Data¶

Absolute Runtime vs batch size¶

Figure 1: Absolute runtimes to traverse full training set (50,000 examples). Reporting average runtime over 5 complete epochs (discarding the runtime of the first epoch since it always has overhead). Any missing data indicates "out of memory" occurred at that batch size. Reported TFLOPS are for float32 operations.

Relative Runtime vs batch size (reference = V100)¶

Figure 2: Relative runtime comparison on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Any y-axis value below 1.0 would indicate the method is faster than V100 at that batchsize. Reporting average runtime over 5 complete epochs (discarding the runtime of the first epoch since it always has overhead). Any missing data indicates "out of memory" occurred at that batch size. Reported TFLOPS are for float32 operations.

GPU memory usage vs batch size¶

Figure 3: GPU memory consumption on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Averaged over 5 epochs (we observed very little fluctuation after 1 epoch). Any missing data indicates "out of memory" occurred at that batch size. Trained using gradient descent in Tensorflow with gpu memory growth enabled.

CPU memory usage vs batch size¶

Figure 4: CPU memory consumption on SVHN using SSL VAE with Wide ResNet 28 CNN architecture. Averaged over 5 epochs (we observed very little fluctuation after 1 epoch). Any missing data indicates "out of memory" occurred at that batch size. Trained using gradient descent in Tensorflow with gpu memory growth enabled.

GPU	GPU_hardware	batchsize	mean_runtime	mean_gpu_mem	mean_cpu_mem	fraction_of_v100_runtime
T4	T4 8 TFLOPS + 16 GB	25	404.77	2655.0	4282.65	1.39
T4	T4 8 TFLOPS + 16 GB	50	393.10	4703.0	4347.52	1.94
P100	P100 9 TFLOPS + 16 GB	25	342.30	2591.0	3709.99	1.17
P100	P100 9 TFLOPS + 16 GB	50	270.58	4639.0	3735.52	1.34
P100	P100 9 TFLOPS + 16 GB	100	227.23	8735.0	3905.05	1.44
RTX2080Ti	2080Ti 13 TFLOPS + 11 GB	50	233.06	4765.0	5106.88	1.15
RTX2080Ti	2080Ti 13 TFLOPS + 11 GB	100	171.66	8861.0	5383.15	1.09
RTX2080Ti	2080Ti 13 TFLOPS + 11 GB	200	NaN	NaN	NaN	NaN
RTX2080Ti	2080Ti 13 TFLOPS + 11 GB	400	NaN	NaN	NaN	NaN
RTX2080Ti	2080Ti 13 TFLOPS + 11 GB	800	NaN	NaN	NaN	NaN
RTX2080Ti	2080Ti 13 TFLOPS + 11 GB	1600	NaN	NaN	NaN	NaN
BrownRTX2080Ti	2080Tiv2 13 TFLOPS + 11 GB	25	425.71	2715.0	5747.67	1.46
BrownRTX2080Ti	2080Tiv2 13 TFLOPS + 11 GB	50	282.31	4763.0	5614.73	1.40
BrownRTX2080Ti	2080Tiv2 13 TFLOPS + 11 GB	100	208.90	8859.0	5878.01	1.32
BrownRTX2080Ti	2080Tiv2 13 TFLOPS + 11 GB	200	NaN	NaN	NaN	NaN
V100	V100 15 TFLOPS + 32 GB	25	291.72	2905.0	4228.56	1.00
V100	V100 15 TFLOPS + 32 GB	50	202.17	4953.0	4158.98	1.00
V100	V100 15 TFLOPS + 32 GB	100	158.20	9049.0	4279.63	1.00
V100	V100 15 TFLOPS + 32 GB	200	130.07	17241.0	4709.92	1.00
V100	V100 15 TFLOPS + 32 GB	400	92.70	31443.0	5188.68	1.00
V100	V100 15 TFLOPS + 32 GB	800	74.02	31443.0	5135.33	1.00
V100	V100 15 TFLOPS + 32 GB	1600	NaN	NaN	NaN	NaN