Skip to content

Nvidia Hardware Term

  • HBM: High Bandwidth Memory, a high-performance memory technology.
    • Traditional way: GPU -> PCB traces(long) -> GDDR memory chips(distributed across the PCB)
    • HBM way: GPU -> Interposer(very short) -> HBM stack(close to the GPU)
  • NVLink: High-speed GPU interconnect technology is used to solve communcation bottlenecks in multi-GPU systems.
    • Allow multiple GPUs to directly transfer data to each other without going through the CPU or PCIe bus
    • The bandwidth far exceeds PCIe
  • Cuda Core: General-purpose parallel processing unit, it can perform a wide range of calculation.
  • Tensor Core: Specifically designed to accelerate matrix operations in deep learning. Each tensor core only support one precision matrix such as FP32/ FP64.
    E.g.: C = A * B (4*4 matrix)
    • CUDA Core matrix operaions
      • 64 times matrix multiplication operation
      • 48 times matrix addition operation
      • 112 times commands
      • several clock period
    • Tensor core matrix operations
      • 1 specific command
      • 1 clock period
  • TOPS: tera operations per second (INT8)
  • TFLOPS: tera floating-point operations per second (FP16)
  • DLSS: Use AI to upscale low-resolution images to high resolution. Training neural network to "fill in the details"
  • RT Core: Specifically designed to accelerate ray tracing calculations.
  • Structured Sparsity: Regularly removing weights allows the hardware to accelerate efficiently
  • MIG: Divide a large GPU into multiple smaller GPUs, each running independently. MIG enables team members to work simultaneously by partitioning a single GPU into independent instances, each allocated to an individual user.
  • Transformer Engine: Hardware acceleration engine specifically optimized for Transformer models
    • Automatically switch between FP8 and FP16
    • Optimize Self-Attention operation
  • Tensor Memory Accelerator: Dedicated hardware unit optimizes data movement between GPU memory levels
  • DPX: Dynamic Programming Instructions
  • ECC: Error-Correcting Code, Memory error detection and correction technology, automatically detects and repairs bit flip errors in memory, but it will shrink 12.5% of memory
  • Micro-Tensor scaling: Fine-grained Tensor quantization scaling techniques (related to FP8 training) since FP8 has a small dynamic range and is prone to overflow/underflow.
    • Tradition scaling: The entire tensor uses a scaling factor
    • Per Block scaling: Tensor is divided into blocks, and each block is scaled independently
  • SXM and NVLink-C2C: SXM is used to communicate between Intel/AMD CPU and Nvidia GPU by using PCIe, but it's too slow, 180GB/s. Nvidia investigates Grace CPU which support NVLink-C2C, which support 9000GB/s.