For completeness (and to give some actual Fiji info to this thread), I calculated the cost to spill N registers to memory on Fiji. In this case we assume a big shader that fills the whole GPU. All threads start executing roughly at the same time, and spill roughly at the same time to memory. We assume full occupancy on all CUs.
64 CUs * 40 waves/CU * 64 threads/wave * N registers/thread * 4 bytes/register = N * 640 kB.
As said earlier, spilling one register is covered by L1 cache, and spilling three (640 kB * 3 = 1920 kB) is covered by L2. However this trashes all L1 and L2 caches completely, so the GPU is certainly going to stall for a while, unless all the other data needed for work is already in LDS.
GCN also has 8kB of scalar registers (SGPR) per CU. This gives fast storage space for 51 extra registers per wave (one 32 bit value per 64 threads). This is the best way to store data that is constant across the thread group. Unfortunately PC graphics APIs do not expose the scalar unit registers directly. The compiler can take advantage of SGPRs in some specific cases, for example when it knows for sure (at compile time) that all threads in the wave would load data from the same address (static constant buffer load for example or buffer indexed load using SV_GroupId as the index). This is a great way to reduce VGPR pressure.
64 CUs * 40 waves/CU * 64 threads/wave * N registers/thread * 4 bytes/register = N * 640 kB.
As said earlier, spilling one register is covered by L1 cache, and spilling three (640 kB * 3 = 1920 kB) is covered by L2. However this trashes all L1 and L2 caches completely, so the GPU is certainly going to stall for a while, unless all the other data needed for work is already in LDS.
GCN also has 8kB of scalar registers (SGPR) per CU. This gives fast storage space for 51 extra registers per wave (one 32 bit value per 64 threads). This is the best way to store data that is constant across the thread group. Unfortunately PC graphics APIs do not expose the scalar unit registers directly. The compiler can take advantage of SGPRs in some specific cases, for example when it knows for sure (at compile time) that all threads in the wave would load data from the same address (static constant buffer load for example or buffer indexed load using SV_GroupId as the index). This is a great way to reduce VGPR pressure.
Last edited: