Reading Note - Programming Massively Parallel Processors

About Tiling

Tilling Matrix Muliplication 另一个理解方式可是是 Phase by Phase。就是说，要计算一个 tile 的结果，我们需要 do mat multiplication of the two input matrix tiles phase by phase.

但是对于 Convolution 这样的情况，tilling 就只是 Tiling。主要就是在 output matrix 上面进行 Tiling。

所以对于 tiling 来讲，最合理的理解方式还是 tiling the output matrix.

About memory coalescing

关键是想清楚 “下一个” thread 究竟是在处理哪里的数据，从哪里获得数据。

然后对于 thread x index iteration，可能最好的方式是把这个 thread x index 当作是最内层的 loop。所有 kernel 里面的 for loop. 如果 thread x index 被当作的最内层的 loop iteration (当然这个不会写在代码里面，so should be considered as an implicit iteration)，同时 input data 都是 row major 的，那么基本上就可以保证 memory coalescing。

这样就意味着，哪怕对于代码里面最内层的 loop，里面q其实有一层针对 thread x index 的 iteration.
- 而这一层的 iteration，中，每个 thread 需要的 data,大概率就是相邻的（因为除了 thread x index 以外的 loop iteration index 都在 thread x index iteration 的外面了。

Constraints to keep in mind

Assume we use A100 GPU.

One thread block can have 1024 threads at most
One SM can be scheduled to work on 2048 threads at most
A grid can have many thread blocks
- x: $2^{32}$
- y: $2^{16}$
- z: $2^{16}$
- Basicsally, this is where ‘compute jobs’ are supposed to be ‘scalable’ for scheduling.
Each Wrap runs 32 threads in SIMD manner
Max number of registers per SM: 65536
- To reach 2048 threads (maximum occupancy), each thread can use no more than 32 registers
Useage of Global Memory Bandwidth vs Computation, ideal ratio: 12.5OP/B
- For every byte load from global memory, do 12.5 compute operations
  - This is only the threshold for reaching A100’s 19,500GFLOPS perf for SM.
  - For TensorCore, the roof is much higher: 156,000GFLOPS
Max number of Shared Memory per SM: 164KB.
- To reach 2048 threads (maximum occupancy), each thread can use no more than 82 Bytes
- [2025-08-09-Saturday] (from Gemini 2.5 with topics about CUTLASS) per thread block 的 shared memory usage。所以一个 thread block 里面运行的 task, 其 shared memory usage should not exceed 164KB.
  - Actually, considering “隐藏数据加载延迟” (double buffering)，one thread block should consider use even fewer shared memory.
    - Actually, for the 164KB shared memory space, usually, we consider 96KB as ‘available’ and if we want to allow ‘double buffering’ to hide data loading latency, we should strieve to use 96KB/2 → 48KB shared memory per thread block.
      - [2025-08-10-Sunday] https://chipsandcheese.com/p/blackwell-nvidias-massive-gpu. For blackwell consumer/prosumer level product: RTX 6000, we only have 128KB space shared between shared memory and L1 cache. And A100 (same gen with 3090), each SM has 192KB on-chip physical space and 164KB of it can be used for shared memory.
Constant memory is actually just a space in global memory, but hints the runtime to aggressively cache its content.
- So the real constraints is not the size of constant memory, but the L1 and L2 cache.
  - Usually it is good to keep constant memory content fewer than 64KB.
  - L1: usually between 16 to 64KB.
  - L2: usually a few hundred kilobytes to a small number of MBs.

NOTE

Following things are simular or closely related

Shared Memory in CUDA

The underlying space used for ‘Local Memory’ in OpenCL

AMD’s Local Data Share (LDS)

Intel’s Shared Local Memory (SLM)

折腾 Zhēteng

Explorer

Reading Note - Programming Massively Parallel Processors

About Tiling

About memory coalescing

Constraints to keep in mind

Graph View

Table of Contents