NVIDIA Hopper GH100 GPU Unveiled: The World’s First & Fastest 4nm Data Center Chip, Up To 4000 TFLOPs Compute, HBM3 3 TB/s Memory
Hassan Mujtaba
WccftechNVIDIA has officially unveiled its next-generation data center powerhouse, the Hopper GH100 GPU, featuring a brand new 4nm process node. The GPU is an absolute monster with 80 Billion transistors and offering the fastest AI & Compute horsepower of any GPU on the market.
NVIDIA Hopper GH100 GPU Official: First 4nm & HBM3 Equipped Data Center Chip, 80 Billion Transistors, Fastest AI/Compute Product On The Planet With Up To 4000 TFLOPs of Horsepower
Based on the Hopper architecture, the Hopper GPU is an engineering marvel that's produced on the bleeding-edge TSMC 4nm process node. Just like the data center GPUs that came before it, the Hopper GH100 will be targetted at various workloads including Artificial Intelligence (AI), Machine Learning (ML), Deep Neural Networking (DNN) and various HPC focused compute workloads. The GPU is the one-go solution for all HPC requirements and it's one monster of a chip if we look at its size and performance figures.
So coming to the specifications, the NVIDIA Hopper GH100 GPU is composed of a massive 144 SM (Streaming Multiprocessor) chip layout which is featured in a total of 8 GPCs. These GPCs rock total of 9 TPCs which are further composed of 2 SM units each. This gives us 18 SMs per GPC and 144 on the complete 8 GPC configuration. Each SM is composed of up to 128 FP32 units which should give us a total of 18,432 CUDA cores.
This is a 2.25x increase over the full GA100 GPU configuration. NVIDIA is also leveraging from more FP64, FP16 & Tensor cores within its Hopper GPU which would drive up performance immensely. And that's going to be a necessity to rival Intel's Ponte Vecchio which is also expected to feature 1:1 FP64.
The cache is another space where NVIDIA has given much attention, upping it to 48 MB in the Hopper GH100 GPU. This is a 20% increase over the 40 MB cache featured on the Ampere GA100 GPU and 3x the size of AMD's flagship Aldebaran MCM GPU, the MI250X.
Rounding up the performance figures, NVIDIA's GH100 Hopper GPU will offer 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32 and 60 TFLOPs of FP64 Compute performance. These record-shattering figures decimate all other HPC accelerators that came before it. For comparison, this is 3.3x faster than NVIDIA's own A100 GPU and 28% faster than AMD's Instinct MI250X in the FP64 compute. In FP16 compute, the H100 GPU is 3x faster than A100 and 5.2x faster than MI250X which is literally bonkers.
For memory, the NVIDIA Hopper GH100 GPU is equipped with the brand new HBM3 memory that operates across a 6144-bit bus interface and delivers up to 3 TB/s of bandwidth, a 50% increase over the A100's HBM2e memory subsystem. Each H100 accelerator will be equipped with 80 GB of memory though we can expect a double memory capacity configuration in the future like the A100 80 GB.
The GPU also features PCIe Gen 5 compliancy with up to 128 GB/s transfer rates and an NVLINK interface that provides 900 GB/s of GPU-to-GPU inter-connected bandwidth. The whole Hopper H100 chip offers an insane 4.9 TB/s of external bandwidth. All of this monster performance comes in a 700W (SXM) package. The PCIe variants will be equipped with the latest PCIe Gen 5 connectors, allowing for up to 600W of power.
NVIDIA Hopper GH100 'Preliminary Specs':
NVIDIA Tesla Graphics Card
Tesla K40
(PCI-Express)
Tesla M40
(PCI-Express)
Tesla P100
(PCI-Express)
Tesla P100 (SXM2)
Tesla V100 (SXM2)
NVIDIA A100 (SXM4)
NVIDIA H100 (SMX4?)
GPU
GK110 (Kepler)
GM200 (Maxwell)
GP100 (Pascal)
GP100 (Pascal)
GV100 (Volta)
GA100 (Ampere)
GH100 (Hopper)
Process Node
28nm
28nm
16nm
16nm
12nm
7nm
4nm
Transistors
7.1 Billion
8 Billion
15.3 Billion
15.3 Billion
21.1 Billion
54.2 Billion
80 Billion
GPU Die Size
551 mm2
601 mm2
610 mm2
610 mm2
815mm2
826mm2
~1000mm2?
SMs
15
24
56
56
80
108
134 (Per Module)
TPCs
15
24
28
28
40
54
TBD
FP32 CUDA Cores Per SM
192
128
64
64
64
64
64?
FP64 CUDA Cores / SM
64
4
32
32
32
32
32?
FP32 CUDA Cores
2880
3072
3584
3584
5120
6912
8576 (Per Module)
17152 (Complete)
FP64 CUDA Cores
960
96
1792
1792
2560
3456
4288 (Per Module)?
8576 (Complete)?
Tensor Cores
N/A
N/A
N/A
N/A
640
432
TBD
Texture Units
240
192
224
224
320
432
TBD
Boost Clock
875 MHz
1114 MHz
1329MHz
1480 MHz
1530 MHz
1410 MHz
~1400 MHz
TOPs (DNN/AI)
N/A
N/A
N/A
N/A
125 TOPs
1248 TOPs
2496 TOPs with Sparsity
TBD
FP16 Compute
N/A
N/A
18.7 TFLOPs
21.2 TFLOPs
30.4 TFLOPs
312 TFLOPs
624 TFLOPs with Sparsity
779 TFLOPs (Per Module)?
1558 TFLOPs with Sparsity (Per Module)?
FP32 Compute
5.04 TFLOPs
6.8 TFLOPs
10.0 TFLOPs
10.6 TFLOPs
15.7 TFLOPs
19.4 TFLOPs
156 TFLOPs With Sparsity
24.2 TFLOPs (Per Module)?
193.6 TFLOPs With Sparsity?
FP64 Compute
1.68 TFLOPs
0.2 TFLOPs
4.7 TFLOPs
5.30 TFLOPs
7.80 TFLOPs
19.5 TFLOPs
(9.7 TFLOPs standard)
24.2 TFLOPs (Per Module)?
(12.1 TFLOPs standard)?
Memory Interface
384-bit GDDR5
384-bit GDDR5
4096-bit HBM2
4096-bit HBM2
4096-bit HBM2
6144-bit HBM2e
6144-bit HBM3
Memory Size
12 GB GDDR5 @ 288 GB/s
24 GB GDDR5 @ 288 GB/s
16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s
16 GB HBM2 @ 732 GB/s
16 GB HBM2 @ 900 GB/s
Up To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 1.6 TB/s
Up To 100 GB HBM3 @ 3.0 Gbps
L2 Cache Size
1536 KB
3072 KB
4096 KB
4096 KB
6144 KB
40960 KB
49152 KB
TDP
235W
250W
250W
300W
300W
400W
700W
Continue Reading