NVIDIA Hopper GH100 GPU Unveiled: The World’s First & Fastest 4nm Data Center Chip, Up To 4000 TFLOPs Compute, HBM3 3 TB/s Memory

Hassan Mujtaba

Wccftech


NVIDIA Hopper GH100 GPU Unveiled: The World’s First & Fastest 4nm Data Center Chip, Up To 4000 TFLOPs Compute, HBM3 3 TB/s Memory

NVIDIA has officially unveiled its next-generation data center powerhouse, the Hopper GH100 GPU, featuring a brand new 4nm process node. The GPU is an absolute monster with 80 Billion transistors and offering the fastest AI & Compute horsepower of any GPU on the market.

NVIDIA Hopper GH100 GPU Official: First 4nm & HBM3 Equipped Data Center Chip, 80 Billion Transistors, Fastest AI/Compute Product On The Planet With Up To 4000 TFLOPs of Horsepower

Based on the Hopper architecture, the Hopper GPU is an engineering marvel that's produced on the bleeding-edge TSMC 4nm process node. Just like the data center GPUs that came before it, the Hopper GH100 will be targetted at various workloads including Artificial Intelligence (AI), Machine Learning (ML), Deep Neural Networking (DNN) and various HPC focused compute workloads. The GPU is the one-go solution for all HPC requirements and it's one monster of a chip if we look at its size and performance figures.

  • 2022-03-22_20-28-57
  • 2022-03-22_20-33-09
  • 2022-03-22_20-29-04
  • 2022-03-22_20-31-06
  • 2022-03-22_20-32-03
  • 2022-03-22_20-32-38
  • 2022-03-22_20-32-57
  • 2022-03-22_20-33-49

So coming to the specifications, the NVIDIA Hopper GH100 GPU is composed of a massive 144 SM (Streaming Multiprocessor) chip layout which is featured in a total of 8 GPCs. These GPCs rock total of 9 TPCs which are further composed of 2 SM units each. This gives us 18 SMs per GPC and 144 on the complete 8 GPC configuration. Each SM is composed of up to 128 FP32 units which should give us a total of 18,432 CUDA cores.

This is a 2.25x increase over the full GA100 GPU configuration. NVIDIA is also leveraging from more FP64, FP16 & Tensor cores within its Hopper GPU which would drive up performance immensely. And that's going to be a necessity to rival Intel's Ponte Vecchio which is also expected to feature 1:1 FP64.

The cache is another space where NVIDIA has given much attention, upping it to 48 MB in the Hopper GH100 GPU. This is a 20% increase over the 40 MB cache featured on the Ampere GA100 GPU and 3x the size of AMD's flagship Aldebaran MCM GPU, the MI250X.

Rounding up the performance figures, NVIDIA's GH100 Hopper GPU will offer 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32 and 60 TFLOPs of FP64 Compute performance. These record-shattering figures decimate all other HPC accelerators that came before it. For comparison, this is 3.3x faster than NVIDIA's own A100 GPU and 28% faster than AMD's Instinct MI250X in the FP64 compute. In FP16 compute, the H100 GPU is 3x faster than A100 and 5.2x faster than MI250X which is literally bonkers.

For memory, the NVIDIA Hopper GH100 GPU is equipped with the brand new HBM3 memory that operates across a 6144-bit bus interface and delivers up to 3 TB/s of bandwidth, a 50% increase over the A100's HBM2e memory subsystem. Each H100 accelerator will be equipped with 80 GB of memory though we can expect a double memory capacity configuration in the future like the A100 80 GB.

The GPU also features PCIe Gen 5 compliancy with up to 128 GB/s transfer rates and an NVLINK interface that provides 900 GB/s of GPU-to-GPU inter-connected bandwidth. The whole Hopper H100 chip offers an insane 4.9 TB/s of external bandwidth. All of this monster performance comes in a 700W (SXM) package. The PCIe variants will be equipped with the latest PCIe Gen 5 connectors, allowing for up to 600W of power.

NVIDIA Hopper GH100 'Preliminary Specs':

NVIDIA Tesla Graphics Card

Tesla K40
(PCI-Express)

Tesla M40
(PCI-Express)

Tesla P100
(PCI-Express)

Tesla P100 (SXM2)

Tesla V100 (SXM2)

NVIDIA A100 (SXM4)

NVIDIA H100 (SMX4?)

GPU

GK110 (Kepler)

GM200 (Maxwell)

GP100 (Pascal)

GP100 (Pascal)

GV100 (Volta)

GA100 (Ampere)

GH100 (Hopper)

Process Node

28nm

28nm

16nm

16nm

12nm

7nm

4nm

Transistors

7.1 Billion

8 Billion

15.3 Billion

15.3 Billion

21.1 Billion

54.2 Billion

80 Billion

GPU Die Size

551 mm2

601 mm2

610 mm2

610 mm2

815mm2

826mm2

~1000mm2?

SMs

15

24

56

56

80

108

134 (Per Module)

TPCs

15

24

28

28

40

54

TBD

FP32 CUDA Cores Per SM

192

128

64

64

64

64

64?

FP64 CUDA Cores / SM

64

4

32

32

32

32

32?

FP32 CUDA Cores

2880

3072

3584

3584

5120

6912

8576 (Per Module)
17152 (Complete)

FP64 CUDA Cores

960

96

1792

1792

2560

3456

4288 (Per Module)?
8576 (Complete)?

Tensor Cores

N/A

N/A

N/A

N/A

640

432

TBD

Texture Units

240

192

224

224

320

432

TBD

Boost Clock

875 MHz

1114 MHz

1329MHz

1480 MHz

1530 MHz

1410 MHz

~1400 MHz

TOPs (DNN/AI)

N/A

N/A

N/A

N/A

125 TOPs

1248 TOPs
2496 TOPs with Sparsity

TBD

FP16 Compute

N/A

N/A

18.7 TFLOPs

21.2 TFLOPs

30.4 TFLOPs

312 TFLOPs
624 TFLOPs with Sparsity

779 TFLOPs (Per Module)?
1558 TFLOPs with Sparsity (Per Module)?

FP32 Compute

5.04 TFLOPs

6.8 TFLOPs

10.0 TFLOPs

10.6 TFLOPs

15.7 TFLOPs

19.4 TFLOPs
156 TFLOPs With Sparsity

24.2 TFLOPs (Per Module)?
193.6 TFLOPs With Sparsity?

FP64 Compute

1.68 TFLOPs

0.2 TFLOPs

4.7 TFLOPs

5.30 TFLOPs

7.80 TFLOPs

19.5 TFLOPs
(9.7 TFLOPs standard)

24.2 TFLOPs (Per Module)?
(12.1 TFLOPs standard)?

Memory Interface

384-bit GDDR5

384-bit GDDR5

4096-bit HBM2

4096-bit HBM2

4096-bit HBM2

6144-bit HBM2e

6144-bit HBM3

Memory Size

12 GB GDDR5 @ 288 GB/s

24 GB GDDR5 @ 288 GB/s

16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s

16 GB HBM2 @ 732 GB/s

16 GB HBM2 @ 900 GB/s

Up To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 1.6 TB/s

Up To 100 GB HBM3 @ 3.0 Gbps

L2 Cache Size

1536 KB

3072 KB

4096 KB

4096 KB

6144 KB

40960 KB

49152 KB

TDP

235W

250W

250W

300W

300W

400W

700W

Continue Reading

Loading data