Just a little note about why CPUs need larger caches than GPUs for good performance.
The slowest improving thing in computers is not bandwidth. It's latency. Thing is, serial processing is very latency sensitive, since no amount of OOO magic can get around true data dependence, which is in abundance. And even if it could...
An L3 cache miss costs around 200 cycles. With 6-wide issue, that amounts to over 1000 instructions. If your ROB is around 200 entries (actually, they're a bit smaller...), you can only cover at most 20% of the stall, assuming that by some miracle none of those instructions depend on the data pending from the load. Remember, the purpose of the ROB is to commit instructions *in order*, so those 200 entries cover the next 199 instructions following the load, and no more. Thus, you stall for at least 800 instructions.
Even a L2 miss, at around 40 cycles, forces a stall even in the best case, since 40*6 = 240 > 200.
*Minor correction here: ld instructions aren't likely to land in the reorder buffer, since they are handled separately to the rest of the pipeline. However, the first time you use the result from the ld, it will enter the ROB and stall everything else once the buffer fills up to the time it finally commits. This can be sidestepped with prefetching, but effectively scheduling that 800 instructions in advance can be difficult.*
Now, for 3D rendering...
It is correct that mip-mapping reduces the active memory visitation from frame to frame, but let's look at some numbers. 1960*1080 pixels * (4 bytes diffuse color + 4 bytes normal map + 2 bytes specular intensity + 2 bytes specular power) * 2 samples average ansiotropic filtering = 50 MB texture lookups. Then there's multi-texturing, as well as other textures I missed (heightmap for relief mapping, light map, etc.). Finally, there's the screen buffers, which for a modern deferred render can take hundreds of MB (start with 16 byte pixels for HDR, add in position data and color for perhaps 4 lights per pixel, don't forget the pixel normals, specular color and power and intensity, and several other things besides. It gets big fast.). There's absolutely no way all that will fit in cache, not even close. This means multiple L3 cache misses per pixel, which as I mentioned before each cost at least 800 instructions. There's a very good reason software renders are too slow for anything more advanced than PS2 level graphics.
Now you can get around some of this by prefetching, and other tricks (which are unnecessary with simple SMT), but you are still ultimately running hundreds of MB of compulsory RAM access every frame. Beyond the benefit the L1 cache offers for local reuse, e.g. in filtering, the cache hierarchy is unused.
Incidentally, there are common shader techniques (relief mapping and kin, shadow maps, even simple environment maps) which are hard to do with prefetching, since they involve data dependent lookups for each pixel.
Another point this brings up is that in order to get anything resembling decent performance from a CPU, even for something as "simple" as 3D rendering, you have to do some really hairy optimizations. Such as figuring out how to prefetch data dependent stuff for (arbitrary!) shaders to avoid eating those 800 instructions worth of latency from any instruction depending on the result of all those L3 cache misses. A GPU can get by with much simpler code, since you don't have to do contortions to make sure everything is prefetched by the time you use it, since the SMT handles it automatically. Also, the fact that they rely on many narrow processors rather than few wide processors means that a 200 cycle stall only eats up say 400 instructions, rather than 1200.
The slowest improving thing in computers is not bandwidth. It's latency. Thing is, serial processing is very latency sensitive, since no amount of OOO magic can get around true data dependence, which is in abundance. And even if it could...
An L3 cache miss costs around 200 cycles. With 6-wide issue, that amounts to over 1000 instructions. If your ROB is around 200 entries (actually, they're a bit smaller...), you can only cover at most 20% of the stall, assuming that by some miracle none of those instructions depend on the data pending from the load. Remember, the purpose of the ROB is to commit instructions *in order*, so those 200 entries cover the next 199 instructions following the load, and no more. Thus, you stall for at least 800 instructions.
Even a L2 miss, at around 40 cycles, forces a stall even in the best case, since 40*6 = 240 > 200.
*Minor correction here: ld instructions aren't likely to land in the reorder buffer, since they are handled separately to the rest of the pipeline. However, the first time you use the result from the ld, it will enter the ROB and stall everything else once the buffer fills up to the time it finally commits. This can be sidestepped with prefetching, but effectively scheduling that 800 instructions in advance can be difficult.*
Now, for 3D rendering...
It is correct that mip-mapping reduces the active memory visitation from frame to frame, but let's look at some numbers. 1960*1080 pixels * (4 bytes diffuse color + 4 bytes normal map + 2 bytes specular intensity + 2 bytes specular power) * 2 samples average ansiotropic filtering = 50 MB texture lookups. Then there's multi-texturing, as well as other textures I missed (heightmap for relief mapping, light map, etc.). Finally, there's the screen buffers, which for a modern deferred render can take hundreds of MB (start with 16 byte pixels for HDR, add in position data and color for perhaps 4 lights per pixel, don't forget the pixel normals, specular color and power and intensity, and several other things besides. It gets big fast.). There's absolutely no way all that will fit in cache, not even close. This means multiple L3 cache misses per pixel, which as I mentioned before each cost at least 800 instructions. There's a very good reason software renders are too slow for anything more advanced than PS2 level graphics.
Now you can get around some of this by prefetching, and other tricks (which are unnecessary with simple SMT), but you are still ultimately running hundreds of MB of compulsory RAM access every frame. Beyond the benefit the L1 cache offers for local reuse, e.g. in filtering, the cache hierarchy is unused.
Incidentally, there are common shader techniques (relief mapping and kin, shadow maps, even simple environment maps) which are hard to do with prefetching, since they involve data dependent lookups for each pixel.
Another point this brings up is that in order to get anything resembling decent performance from a CPU, even for something as "simple" as 3D rendering, you have to do some really hairy optimizations. Such as figuring out how to prefetch data dependent stuff for (arbitrary!) shaders to avoid eating those 800 instructions worth of latency from any instruction depending on the result of all those L3 cache misses. A GPU can get by with much simpler code, since you don't have to do contortions to make sure everything is prefetched by the time you use it, since the SMT handles it automatically. Also, the fact that they rely on many narrow processors rather than few wide processors means that a 200 cycle stall only eats up say 400 instructions, rather than 1200.
Last edited by a moderator: