(Slightly OT, continued from my above post)
I did some extra testing with the GCN ROP caches. Assuming the mobile versions have equally sized ROP caches than my Radeon 7970 (128 KB), it looks like rendering in 128x128 tiles might help these new mobile APUs very much, as APUs are very much BW limited.
Particle rendering has a huge backbuffer BW cost, especially when rendering HDR particles to 4x16f backbuffer. Our system renders all particles using a single draw call (particle index buffer is depth sorted, and we use premultiplied alpha to achieve both additive and alpha channel blended particles simultaneously). It is actually over 2x faster to brute force render a draw call containing 10k particles 60 times to 128x128 tiles (move scissor rectangle across a 1280x720 backbuffer) compared to rendering it once (single draw call, full screen). And you can achieve this kinds of gains by spending 15 minutes (just a brute force hack). With a little bit of extra code, you can skip particle quads (using geometry shader) that do not land on the active 128x128 scissor area (and save most of the extra geometry cost). This is a good way to reduce particle overdraw BW cost to zero. A 128x128 tile is rendered (alpha blended) completely inside the GPU ROP cache. This is especially a good technique for low BW APUs, but it helps even the Radeon 7970 GE (with massive 288 GB/s BW).
With this technique, soft particles gain even more, since the full screen depth texture reads (128x128 area) fits the GCN 512/768 KB L2 cache (and become BW free as well). Of course Kepler based chips should have similar gains (but I don't have one for testing).
If techniques like this become popular in future, and developers start to spend lots of time in optimizing for the modern GPU L2/ROP caches, it might make larger GPU memory pools (such as the Haswell 128 MB L4 cache) less important. It's going to be interesting too see how things pan out.
I did some extra testing with the GCN ROP caches. Assuming the mobile versions have equally sized ROP caches than my Radeon 7970 (128 KB), it looks like rendering in 128x128 tiles might help these new mobile APUs very much, as APUs are very much BW limited.
Particle rendering has a huge backbuffer BW cost, especially when rendering HDR particles to 4x16f backbuffer. Our system renders all particles using a single draw call (particle index buffer is depth sorted, and we use premultiplied alpha to achieve both additive and alpha channel blended particles simultaneously). It is actually over 2x faster to brute force render a draw call containing 10k particles 60 times to 128x128 tiles (move scissor rectangle across a 1280x720 backbuffer) compared to rendering it once (single draw call, full screen). And you can achieve this kinds of gains by spending 15 minutes (just a brute force hack). With a little bit of extra code, you can skip particle quads (using geometry shader) that do not land on the active 128x128 scissor area (and save most of the extra geometry cost). This is a good way to reduce particle overdraw BW cost to zero. A 128x128 tile is rendered (alpha blended) completely inside the GPU ROP cache. This is especially a good technique for low BW APUs, but it helps even the Radeon 7970 GE (with massive 288 GB/s BW).
With this technique, soft particles gain even more, since the full screen depth texture reads (128x128 area) fits the GCN 512/768 KB L2 cache (and become BW free as well). Of course Kepler based chips should have similar gains (but I don't have one for testing).
If techniques like this become popular in future, and developers start to spend lots of time in optimizing for the modern GPU L2/ROP caches, it might make larger GPU memory pools (such as the Haswell 128 MB L4 cache) less important. It's going to be interesting too see how things pan out.
Last edited by a moderator: