GPU Architecture and CUDA Programming

GPU Architecture

GPUs are massively parallel processors optimized for throughput.

Key Characteristics

Thousands of lightweight threads
SIMT (Single Instruction, Multiple Thread) execution model
Warps: groups of 32 threads executing in lockstep
High memory bandwidth (hundreds of GB/s)

Memory Hierarchy

Memory Type	Scope	Size	Speed
Registers	Per thread	Limited	Fastest
Shared Memory	Per thread block	~48-96 KB	~Register speed
L1/L2 Cache	Per SM / Device	KB-MB	Fast
Global Memory	All threads	GB	Slow (~600 cycles)
Constant Memory	All threads (read-only)	~64 KB	Cached, fast for uniform access
Texture Memory	All threads (read-only)	Cached	Optimized for spatial locality

CUDA Programming Model

Kernel Launch

kernel<<<numBlocks, threadsPerBlock>>>(args);

Thread Indexing

int i = blockIdx.x * blockDim.x + threadIdx.x;  // 1D
int row = blockIdx.y * blockDim.y + threadIdx.y;  // 2D
int col = blockIdx.x * blockDim.x + threadIdx.x;

Memory Operations

// Allocate device memory
cudaMalloc((void **)&d_A, size);

// Copy host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

// Copy device to host
cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);

// Free
cudaFree(d_A);

Pinned (Page-Locked) Memory

cudaHostAlloc((void **)&h_A, size, cudaHostAllocDefault);
cudaFreeHost(h_A);

Tiling and Shared Memory Optimization

Tiled Matrix Multiplication

Each thread block computes an output tile. Input tiles are loaded into shared memory to reduce global memory accesses.

Reduction factor = \frac{O_T I L E_W I D T H^{2} \times M A S K_W I D T H^{2}}{(O_T I L E_W I D T H + M A S K_W I D T H - 1)^{2}}

Tiled Convolution

Load input tile with halo into shared memory
__syncthreads() barrier
Compute output using shared memory
Write result to global memory

Design Tradeoffs

Option 1: Thread block size = output tile size (some threads load extra input)
Option 2: Thread block size = input tile size (some threads idle during computation)

Larger shared memory usage → better effective memory bandwidth, but fewer thread blocks can be resident on each SM.

Key Optimization Principles

Maximize thread occupancy (more warps can hide memory latency)
Coalesce global memory accesses (consecutive threads access consecutive addresses)
Minimize host-device transfers (keep data on GPU as long as possible)
Use shared memory for data reuse within a block
Avoid thread divergence within warps (different execution paths serialize)
Use constant memory for read-only data uniformly accessed by all threads

GPU Architecture and CUDA Programming ​

GPU Architecture ​

Key Characteristics ​

Memory Hierarchy ​

CUDA Programming Model ​

Kernel Launch ​

Thread Indexing ​

Memory Operations ​

Pinned (Page-Locked) Memory ​

Tiling and Shared Memory Optimization ​

Tiled Matrix Multiplication ​

Tiled Convolution ​

Design Tradeoffs ​

Key Optimization Principles ​