Skip to content

Nvidia GPU Architecture Evolution

Nvidia Hardware Terminology

See in this page: Nvidia Hardware Terminology

Timeline

  • 2016 -- Pascal
  • 2017 -- Volta
  • 2018 -- Turing
  • 2020 -- Ampere
  • 2022 -- Hopper
  • 2024 -- Blackwell
  • 2025+ -- Rubin

Pascal

Introduce HBM2, NVLink1.0(160GB/s), Unified memory architecture

Representitive products:

  • GTX 1060, 1070 and 1080ti
  • Tesla P100
  • Quadro P6000

Those cards doesn't have Tensor Cores. Therefore, they performed poorly during AI training.

Volta

Introduce Tensor Cores 1.0, NVLink 2.0(300GB/s)

Representitive products:

  • Tesla V100
    • 16G/32G HBM2
    • FP16 performance: 125 TFLOPS

Turing

Introduce Tensor Cores 2.0(Append INT8, INT4), RT Cores, DLSS

Representitive Products:

  • RTX 2060, 2070 and 2080ti
  • GTX 1660
  • Tesla T4
  • Quadro RTX 6000, 8000

Ampere

Introducr Tensor Cores 3.0(Append TF32, BF16 and FP64), Structured Sparsity(2:4), MIG, NVLink 3.0(600GB/s)

Representitive Products:

  • A100(40GB/80GB HBM2e)
    • Cuda cores: 6912
    • Tensor cores: 432 (3.0)
    • NVLink 3.0
    • Memory bandwidth: 2039GB/s
    • 312 TFLOPS(FP16)
    • 156 TFLOPS(TF32)
    • 624 TFLOPS(INT8 Sparse)
  • RTX 3060, 3070, 3080
  • RTX3090
    • CUDA cores: 10496
    • Tensore cores: 328
    • Memory: 24G
    • 285 TFLOPS(FP16)

Hopper

Introduce Tensor COres 4.0(Append FP8), Transformer Engine(Dynamically switch FP8/FP16), NVLink 4.0(900GB/s), DPX, Tensor Memory Accelerator, Grace CPU

Representitive Products:

  • H100(80GB HBM3)
    • Cuda cores: 18432
    • Tensor cores: 640(4.0)
    • NVLink 4.0
    • Memory Bandwidth: 3350GB/s
    • 1979 TFLOPS(FP8, Transformer Engine)
    • 990 TFLOPS(FP16)
    • 60 TFLOPS(FP64)
    • 3958 TOPS(INT8 Sparse)
  • H200(141GB HBM3e)
    • Memory Bandwidth: 4800GB/s
    • 141GB HBM3e memory
    • Others are the same as H100
  • Grace CPU
    • Traditional bottleneck: GPU-CPU communication
      • Intel/AMD CPU <- PCIe 5.0(128GB/s) -> NVIDIA GPU
    • NVLink-C2C(chip-to-chip)
      • Grace CPU <- NVLink-C2c(9000GB/s) -> Hopper GPU

AdaLovelace

Introduce Tensor Cores 4.0(consumer grade), DLSS 3.0, RT Core 3.0, GDDR6X upgrade

Representitive Products:

  • RTX 4070, 4080
  • RTX 4090
    • Cuda Cores: 16384
    • Tensor Cores: 512 (4.0)
    • RT Core: 128
    • 24GB GDDR6X memory
    • 1008GB/s memory bandwidth
    • 661 TFLOPS(FP16 Sparse)
    • 83 TFLOP(FP32)
    • 1321 TOPS(INT8)
  • RTX 6000 Ada
  • L40s
    • Cuda Cores: 18176
    • Tensor Cores: 568 (4.0)
    • RT Cores: 142
    • 142 GDDR6X(with ECC) memory
    • 864GB/s memory bandwidth

Blackwell

Introduce Tensor Cores 5.0(Append FP4/FP6), Transformer Engine 2.0, NVLink 5.0(1800GB/s), 2 GPU designs, Micro-Tensor scaling, DLSS 4.0

Representitive Products:

  • B100
  • B200(192GB HBM3e)
    • 2 GPU cips
    • Tensor Cores: Not release yet
    • Memory Bandwidth: 8000GB/s
    • NVLink 5.0
    • 18000 TFLOPS(FP4)
    • 9000 TFLOPS(FP8)
    • 4500 TFLOPS(FP16)
    • 72 PFLOPS(FP4, whole rack)
  • GB200(Grace-Blackwell)
    • 1 Grace CPU(72 core ARM)
    • 2 B200 GPU
    • NVLink-C2C(900GB/s)
    • Whole memory: 192+480(CPU)
  • RTX 5090
    • CUDA cores: 21760
    • Tensor Cores: 680 (5.0)
    • RT cores 170 (4.0)
    • 32G GDDR7 memory
    • Memory Bandwidth: 1792 GB/s
    • 3300 TOPS(FP4)
    • 900 TFLOPS(FP8)
    • DLSS 4.0
  • RTX 5070, 5080