Software/CPU-based 3D Rendering

Just a little note about why CPUs need larger caches than GPUs for good performance.

The slowest improving thing in computers is not bandwidth. It's latency. Thing is, serial processing is very latency sensitive, since no amount of OOO magic can get around true data dependence, which is in abundance. And even if it could...

An L3 cache miss costs around 200 cycles. With 6-wide issue, that amounts to over 1000 instructions. If your ROB is around 200 entries (actually, they're a bit smaller...), you can only cover at most 20% of the stall, assuming that by some miracle none of those instructions depend on the data pending from the load. Remember, the purpose of the ROB is to commit instructions *in order*, so those 200 entries cover the next 199 instructions following the load, and no more. Thus, you stall for at least 800 instructions.

Even a L2 miss, at around 40 cycles, forces a stall even in the best case, since 40*6 = 240 > 200.

*Minor correction here: ld instructions aren't likely to land in the reorder buffer, since they are handled separately to the rest of the pipeline. However, the first time you use the result from the ld, it will enter the ROB and stall everything else once the buffer fills up to the time it finally commits. This can be sidestepped with prefetching, but effectively scheduling that 800 instructions in advance can be difficult.*


Now, for 3D rendering...

It is correct that mip-mapping reduces the active memory visitation from frame to frame, but let's look at some numbers. 1960*1080 pixels * (4 bytes diffuse color + 4 bytes normal map + 2 bytes specular intensity + 2 bytes specular power) * 2 samples average ansiotropic filtering = 50 MB texture lookups. Then there's multi-texturing, as well as other textures I missed (heightmap for relief mapping, light map, etc.). Finally, there's the screen buffers, which for a modern deferred render can take hundreds of MB (start with 16 byte pixels for HDR, add in position data and color for perhaps 4 lights per pixel, don't forget the pixel normals, specular color and power and intensity, and several other things besides. It gets big fast.). There's absolutely no way all that will fit in cache, not even close. This means multiple L3 cache misses per pixel, which as I mentioned before each cost at least 800 instructions. There's a very good reason software renders are too slow for anything more advanced than PS2 level graphics.

Now you can get around some of this by prefetching, and other tricks (which are unnecessary with simple SMT), but you are still ultimately running hundreds of MB of compulsory RAM access every frame. Beyond the benefit the L1 cache offers for local reuse, e.g. in filtering, the cache hierarchy is unused.

Incidentally, there are common shader techniques (relief mapping and kin, shadow maps, even simple environment maps) which are hard to do with prefetching, since they involve data dependent lookups for each pixel.

Another point this brings up is that in order to get anything resembling decent performance from a CPU, even for something as "simple" as 3D rendering, you have to do some really hairy optimizations. Such as figuring out how to prefetch data dependent stuff for (arbitrary!) shaders to avoid eating those 800 instructions worth of latency from any instruction depending on the result of all those L3 cache misses. A GPU can get by with much simpler code, since you don't have to do contortions to make sure everything is prefetched by the time you use it, since the SMT handles it automatically. Also, the fact that they rely on many narrow processors rather than few wide processors means that a 200 cycle stall only eats up say 400 instructions, rather than 1200.
 
Last edited by a moderator:
Dynamic frequency and voltage adjustment is built into most of the desktop, server, and mobile architectures I can think of.
But it's not adjusted based on the workload, which is what I was asking about. To balance a CPU's responsiveness and throughput for a given TDP, it has to adjust things based on the workload, per core.
FMA does neither. It explicitly recognizes a serial dependence and optimizes the hardware for that case. Being able to extract or infer semantic information about the workload because of hints in software or primitives added to the architecture can extract performance without having to scan more instructions, lanes, or threads.
Absolutely. Which is why I've also suggested adding new instructions before. In particular, I think extending the scalar BMI instructions to vector instructions could be very valuable, and not just for graphics.
It may end Moore's Law. Technically, if you think process nodes are all that it takes to maintain it, Moore's Law would have ended already. The 200, 300, and eventually 450mm wafer sizes are there to maintain the cost per device. The same goes for interposers and die stacking, and the some of the attempts at doing something other than silicon or something different than photolithography.

A lot of these methods can increase the number of transistors per device without changing the geometry of the transistors. Granted, it won't necessarily relax the power constraints, which is where my argument kicks into high gear.

As I said, Moore's Law is about how many transistors you can affordably fit onto a device. I don't see a strong limit on how transistors can be added to any kind of design, but a stronger limit on the power they need to dissipate. Transistors are cheap and only getting cheaper.
But you asked specifically; "how many more silicon nodes do you think we have left to hide this future in".

So lets split this argument up: first, there have to be enough silicon node shrinks or other approaches to keep increasing the transistor count, secondly, the transistors have to continue to become cheaper to produce, and thirdly, the transistors/design/architecture have to become more power efficient. Correct me if I'm wrong, but it now looks like you agree that the first and second aspect isn't likely to run into insurmountable issues just yet.

The power consumption might indeed be the trickiest part. But I don't see any reason for despair. First of all, CPU cores can double or quadruple the SIMD throughput (again) without costing an equal increase in power consumption, because it represents only a fraction of the power budget. There's no way Haswell will consume more than Westmere, and that trend will most likely continue. And then there's the opportunity for long-running wide vector instructions which allow a further reduction in power consumption. Next, there's the piecemeal introduction of NTV technology and adjusting the clock frequency based on the workload. And lastly, tons of research is going into lowering the transistors' power consumption now. Multigate transistors were an important breakthrough, and junctionless transistors could be the next major leap which make the ITRS projections highly conservative.
I guess we'll see about that.
We don't have to wait and see. The facts are already known. At the "optimal" operating point, the clock frequency is ~9x lower, while the power consumption is ~45x lower. To compensate for this loss in absolute performance, you'd need an order of magnitude more transistors. And that's just to keep the performance level. It offers a nice 5x reduction in power consumption, but at an insane increase in die size. Note that I didn't even calculate in the transistor/area increase due to NTV technology itself yet, nor any performance loss due to Amdahl's Law.

Transistors may be getting cheaper, but only at the rate of Moore's Law, at best. Only niche markets where low power consumption is way more important than absolute performance, can afford to have chips that nominally run at NTV voltage. The only commercially viable use for consumer products is for low idle power consumption.

So again, you don't want full-blown NTV technology any time soon. Just a minor reduction in voltage with ever new process node gives the best balance between frequency, area, and power consumption. This can be achieved with less invasive design changes.
Slow in clock speed. It'll render circles around the CPUs without compromising the user's experience or battery life. The CPUs can burn all the way to their turbo limit to get whatever single-threaded performance they can. That's what I buy the CPU cores for, anyway.
Slow in absolute speed. A GPU running at NTV voltage will decimate the framerate. That does compromise the user experience.

Of course when you're talking about near idle operation, such as merely rendering the desktop interface, then this technology becomes valuable. But this is equally true for the CPU.

Also note that today's GPUs are massively overkill for desktop rendering. There's no need for floating-point programmable shaders even. But this is acceptable because the power consumption is adequately low. It won't take much to make CPUs adequately power efficient for this task either.
I guess we'll see if the desktop is the trendsetter, and we will see if they become unified.
Wide out-of-order execution CPUs and DirectX 11 GPUs are coming to mobile devices. So the desktop is still the trendsetter. Regardless, the majority of people aren't gamers. They rarely use the GPU to its fullest. Again just look at the distribution of HD 2500s and HD 4000s. Business desktops benefit more from a quad-core than from a more powerful integrated GPU aimed at gaming.

So with wider SIMD units in the CPU cores to adequately handle graphics, which also benefits other applications, CPU manufacturers can save on die area and thus sell with higher margins. And these chips would also benefit gamers, because they have a discrete GPU anyway and can use the CPU's higher throughput for other workloads.

There's also a convergence in the software that is run on desktops and mobile devices. So when unified CPUs become mainstream on the desktop, it's only a matter of time before applications which depend on high CPU throughput require a similar architecture on mobile devices. It wouldn't suprise me at all if an Atom architecture with AVX2 support is already in the making.
Artificially constrain the amount of instructions that can go into an OoO engine, and you're doing something wrong. If that engine really isn't needed, I can think of a design that can save a lot of trouble.
There's nothing wrong about adjusting to the workload. And there's nothing artificial or constraining about it either. When the buffers are full of long-running instructions, the previous stage(s) can be clock gated for a certain number of cycles since there's plenty of work anyway. They might even do this already (they certainly do something similar for the uop cache and decoders, although that's in the in-order part).

And yes, you could have different architectures for each and every workload, but then there's duplication of logic, extra data movement, and more programming troubles.
What about those exquisite aggressively out of order and speculative memory pipelines and buffers?
Their burden is only 2-4x as much per instruction?
No need for extra prefetch instructions, or more aggressive hardware prefetches overall?
Indeed, with long-running SIMD instructions there is potential for making things like prefetching less aggressive. I've mentioned that before in the Larrabee thread.
Not likely to need more traffic with the scalar pipelines to manage masking or predication?
No. Scalar instructions are interleaved with the long-running SIMD instructions. So they execute at a slower pace as well.
There are design choices that even Intel has put forward that can significantly reduce the burden of developing for specialized cores. As long as the ISA is consistent across them, Intel's SOCs already do a lot behind the scenes that the software is not aware of.
Please elaborate on these design choices. And what do their SoCs do behind the scenes that the software isn't aware of (that other designs don't do)?
Generally software neglect on the part of Nvidia. Not too many loads have Fermi being legitimately faster. It's not a loss I care too much about, since Kepler performs its primary function admirably.
But we weren't talking about Kepler's primary function.
Did you miss the section on the power consumption of a latency-optimized pipeline?
No, but it's obviously highly skewed. First of all, the 21264 wasn't power optimized at all. They didn't care if it consumed more Watts, as long as that made it faster. Secondly, an architecture like Haswell can do 8x more floating-point operations per cycle, just by having wide SIMD units and FMA. And thirdly, you have to take the data locality into account to compare their power efficiency. So it's quite disingenuous to compare that old CPU architecture against a portion of an experimental GPU architecture.
Shared ISA or common abstractions are the next step. The removal of incidental obstacles, per Daly's terminology: like isolated memory spaces and the barrier of an expansion bus.
Bringing it one step closer to unification.
Dynamically-optimizing runtimes are another area of research, where a lightweight scheduler/compiler can hook into performance and event data generated at runtime and shift the instruction mix or thread affinity as it sees fit. This is important going forward not just for performance consistency, but also because power-aware software scheduling has shown some interesting benefits even with static compilation.
Sounds like software rendering to me.
 
Nick said:
But we weren't talking about Kepler's primary function.
By that logic, CPUs will never be suitable for graphics because they are not suitable today. Or would you say graphics is not Ivy Bridge's primary function?
 
The slowest improving thing in computers is not bandwidth. It's latency. Thing is, serial processing is very latency sensitive, since no amount of OOO magic can get around true data dependence, which is in abundance.
Which means GPUs are worse off. They can't schedule around data dependencies within threads, only across threads.
An L3 cache miss costs around 200 cycles. With 6-wide issue, that amounts to over 1000 instructions. If your ROB is around 200 entries (actually, they're a bit smaller...), you can only cover at most 20% of the stall, assuming that by some miracle none of those instructions depend on the data pending from the load. Remember, the purpose of the ROB is to commit instructions *in order*, so those 200 entries cover the next 199 instructions following the load, and no more. Thus, you stall for at least 800 instructions.
You need a perfectly balanced instruction mix to issue 6 uops per cycle. That's not realistic. Also, with x86 every arithmetic instruction can perform a load or store operation. The ROB stores such fused uops. So an L3 miss costs far fewer instructions. Furthermore, Hyper-Threading hides many of the misses.

That said, even if an L3 miss costed 800 instructions, that's barely significant. Today's CPUs execute billions of instructions per second. You can have millions of L3 misses per second, before you start noticing it. GPUs aren't stall free either.
Now, for 3D rendering...

It is correct that mip-mapping reduces the active memory visitation from frame to frame, but let's look at some numbers. 1960*1080 pixels * (4 bytes diffuse color + 4 bytes normal map + 2 bytes specular intensity + 2 bytes specular power) * 2 samples average ansiotropic filtering = 50 MB texture lookups. Then there's multi-texturing, as well as other textures I missed (heightmap for relief mapping, light map, etc.). Finally, there's the screen buffers, which for a modern deferred render can take hundreds of MB (start with 16 byte pixels for HDR, add in position data and color for perhaps 4 lights per pixel, don't forget the pixel normals, specular color and power and intensity, and several other things besides. It gets big fast.). There's absolutely no way all that will fit in cache, not even close.
It really doesn't have to. Memory accesses that follow a regular pattern can be prefetched.
This means multiple L3 cache misses per pixel, which as I mentioned before each cost at least 800 instructions.
Wrong, and wrong.
There's a very good reason software renders are too slow for anything more advanced than PS2 level graphics.
They can do way more than that.
Now you can get around some of this by prefetching, and other tricks (which are unnecessary with simple SMT)...
Prefetching on the GPU is an active research topic. GPUs stall when they run out of threads to switch between (with happens with threads that use lots of registers and frequently access memory). Judicious prefetching is very valuable.
Incidentally, there are common shader techniques (relief mapping and kin, shadow maps, even simple environment maps) which are hard to do with prefetching, since they involve data dependent lookups for each pixel.
There's still a degree of coherency in those accesses.
Another point this brings up is that in order to get anything resembling decent performance from a CPU, even for something as "simple" as 3D rendering, you have to do some really hairy optimizations. Such as figuring out how to prefetch data dependent stuff for (arbitrary!) shaders to avoid eating those 800 instructions worth of latency from any instruction depending on the result of all those L3 cache misses.
The CPU is really good at automatic prefetching.
A GPU can get by with much simpler code, since you don't have to do contortions to make sure everything is prefetched by the time you use it, since the SMT handles it automatically.
Except when it doesn't. And prefetching for the GPU is actually much harder due to frequent switching between threads that can be executing different code that need very different data.
Also, the fact that they rely on many narrow processors rather than few wide processors means that a 200 cycle stall only eats up say 400 instructions, rather than 1200.
GPUs don't have narrow processors. Kepler cores have six 1024-bit SIMD units, and nine other functional units. It has four schedulers which each dispatch two instructions. But if one stalls, the others are likely to stall for the same reason. Things can get quite horrible.

Look, it's obvious that GPUs are good at graphics. But they're not as perfect as you portray them, and CPUs are not half as bad at graphics as you think they are. They just lack wide SIMD units and things like gather support. Haswell is a major improvement. And long-running vector instructions would help to further cover latencies and improve power efficiency. There is no intrinsic reason why a CPU architecture can't become fully adequate for a lot of people's graphics needs.
 
Which means GPUs are worse off. They can't schedule around data dependencies within threads, only across threads.
No, it means you are way better off to defer latency sensitive tasks to specialized processors which excel in this field (lets call them latency optimized cores) while latency tolerant tasks gets executed by throughput optimized cores.
You postulate that in the future one may get a jack of all trades (master of none) device. The vast majority thinks (and tells you for at least a year I think) that a combination of different cores (each one a master in a certain field) will be the higher performing alternative, especially in the age of power constrained dark silicon and huge transistor numbers to spare.
 
By that logic, CPUs will never be suitable for graphics because they are not suitable today. Or would you say graphics is not Ivy Bridge's primary function?
We were discussing GPGPU, not graphics. When I pointed out that Fermi is better at it than Kepler, 3dilettante changed the subject by saying that Kepler performs its primary function (graphics) admirably. I do not disagree with that, but that wasn't the subject.

I don't quite follow how that makes you say that CPUs will never be suitable for graphics.
 
No, it means you are way better off to defer latency sensitive tasks to specialized processors which excel in this field (lets call them latency optimized cores) while latency tolerant tasks gets executed by throughput optimized cores.
The problem is that with data dependencies being "abundant" and latency being the "slowest improving", everything becomes latency sensitive when things scale up. Just following keldor314's logic.
You postulate that in the future one may get a jack of all trades (master of none) device.
Both the CPU and GPU are a Jack of many trades, master of none. If done right, unifying them into a Jack of all trades can be better as a whole.

Just look at what the unification of vertex and pixel processing has done. They were very different workloads at first, but they converged, and eventually caused the hardware to unify. And this opened up a whole new range of possibilities.
The vast majority thinks (and tells you for at least a year I think) that a combination of different cores (each one a master in a certain field) will be the higher performing alternative, especially in the age of power constrained dark silicon and huge transistor numbers to spare.
With all due respect, a lot of people here also thought unifying the vertex and pixel pipelines was a bad idea. They've obviously been proven wrong. A GPU with a combination of different cores can indeed achieve higher performance under certain circumstances, but performance of legacy applications isn't the be-all and end-all. We need more flexible hardware to run new applications.

Things like power consumption are hurdles, they're not walls. Haswell will deliver much higher throughput per Watt than its predecessors, and I've already outlined multiple techniques here that would allow to continue to shrink the gap. Continuous convergence like that can only lead to unification.

Note that AMD is striving to unify the address space of the CPU and GPU. This doesn't come for free, but it offers new opportunities. So unification is something that is desirable, and well worth taking technical hurdles for.
 
Fixed function is always faster and more efficient than programmable though, so whenever the optimal solution to a given problem is discovered shouldnt it be implemented in hardware? By the time cpus are gpod enough how much work will there be left for them to do?
 
Fixed function is always faster and more efficient than programmable though, so whenever the optimal solution to a given problem is discovered shouldnt it be implemented in hardware? By the time cpus are gpod enough how much work will there be left for them to do?
Most computational problems have lots of parameters which are only known at run-time, which completely change the optimal solution. For games, a simple mouse movement can change the graphics workload characteristcs dramatically. There are practically infinite variations on what data and calculations are required when and where. So it's utterly impossible to have optimal hardware for each situation. You need highly flexible, programmable hardware that is capable of dealing with unpredictable events. Furthermore, die space is finite, so specialized units leave less room for other units. Last but not least, moving the data from one specialized unit to another, takes bandwidth, latency, and control logic.

In fact the arithmetic units are becoming just a fraction of the transistor and power budget. It's getting the data down to them that's the hard part. Basically what this means is that you can make things programmable at a very low cost. That's exactly what has been happening with GPUs in the last decade, and continues to happen. Soon even mobile graphics designs will support DirectX 11 level graphics. So the trends are pointing away from fixed-function logic, even for the most power constrained devices. Eventually things like texture filtering will also become fully programmable, simple because the arithmetic cost of it will become far lower than the data accesses, and because there are varying filtering needs.

So in short, fixed-function hardware isn't always faster or more power efficient than programmable hardware, if you look at the whole picture and not just the logic in isolation. Just imagine rendering Crysis 2 with fixed-function graphics chips. It could be done, but it would be horribly slow and power inefficient due to the many rendering passes. The programmability wouldn't actually be gone either, it would just be handled at a higher level. So bringing more programmability and flexibility to the lower level logic is a valuable thing.

It's all a balancing act. Ultimately we'll need unified cores that can extract ILP, DLP and TLP in varying amounts, and are equipped with a versatile set of powerful instructions. In this way instructions essentially become fixed-function building blocks. They have to be chosen wisely to best support the entire range of workloads.
 
In fact the arithmetic units are becoming just a fraction of the transistor and power budget. It's getting the data down to them that's the hard part. Basically what this means is that you can make things programmable at a very low cost. That's exactly what has been happening with GPUs in the last decade, and continues to happen.

Show some evidence that GPUs, especially consumer grade GPUs, have been shifting more and more die area to moving data around instead of ALUs.. especially to the point where they're now not even the major part of the die area. Because every time a shrink happens I see the execution unit count skyrocket while the instruction scheduling doesn't get any more sophisticated and the latency hiding remains the same simple round robin multi-threading. With Kepler nVidia even took a step in the opposite direction you're claiming.

Yes, this hurts some GPGPU workloads.. but according to you GPGPU's days are numbered so that means the GPU vendors can go back to focusing solely on what's good for mainstream graphics. Of course you probably think the GPU's days are numbered too..

Another area which has been pushing opposite the trends you claim is consoles, where over and over again latency sensitive and irregular code patterns is sacrificed for data regular throughput. It remains to be seen what all the console companies will do this generation but so far Nintendo has chosen to pair a traditional GPU that's several times larger than it's CPU so that's one answer.
 
I think Nick is right on this, even Dally said so in his presentation about Exaflop scale. Power isn't consumed doing arithmetics, but moving data around on the chip. That's exactly the reason, why they try so hard to minimize it.

I'm just not sure if I draw the same conclusions. Would need to think more about it.
 
Last edited by a moderator:
I think Nick is right on this, even Dally said so in his presentation about Exaflop scale. Power isn't consumed doing arithmetics, but moving data around on the chip. That's exactly the reason, why they try so hard to minimize it.

He said that about current high end CPUs, not GPUs. His point was exactly that if you don't need to worry about that you can (and do) get a lot more arithmetic work done. nVidia is definitely not on board with Nick's vision.
 
Dally said that most of the energy is spent on moving data, for both CPUs and GPUs…

…And that most of the transistors are spent on scheduling in CPUs, but not in GPUs, especially since NVIDIA has been simplifying scheduling work as much as possible lately. Ultimately, he wants to strip it down to the bare minimum. But that doesn't make fetching data from across the chip less costly in energy, which is still necessary on a GPU.

Simplifying scheduling is great, but it only removes the scheduling overhead, which is largely unnecessary in a massively threaded processor. It's not really what Dally is referring to when he's talking about "moving data". He's just talking about the basic action of fetching.
 
An L3 cache miss costs around 200 cycles. With 6-wide issue, that amounts to over 1000 instructions. If your ROB is around 200 entries (actually, they're a bit smaller...), you can only cover at most 20% of the stall, assuming that by some miracle none of those instructions depend on the data pending from the load. Remember, the purpose of the ROB is to commit instructions *in order*, so those 200 entries cover the next 199 instructions following the load, and no more. Thus, you stall for at least 800 instructions.

Even a L2 miss, at around 40 cycles, forces a stall even in the best case, since 40*6 = 240 > 200.
Sandy/Ivy/Haswell can only fetch/decode four instructions per cycle. So that limits maximum sustainable throughput to 4 instructions per cycle on these CPUs. In real software, the IPC of modern Intel x86 core is usually around 1.0-1.5 (http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/9). Vectorized pixel batch processing is (usually) pretty easy to pipeline and prefetch, so I'd say 2.5 sustained IPC is possible for highly optimized intrinsic based code.

Sandy Bridge (3.3 GHz) L3 miss costs around 150 cycles (http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/6). In total this is 2.5 IPC * 150 cycles = 375 instructions. L2 miss costs 25 cycles. That's 2.5 IPC * 25 cycles = 63 instructions. 168 item ROB can handle the L2 misses easily, but it can only fill around half of the L3 miss. But wouldn't the CPU start to fetch/decode new instructions to the ROB, as it can issue + retire the other 168 instructions in ROB (and free space that way). Hyperthreading makes sure there are many instructions in ROB without dependencies to the current data miss.

I am not trying to say that L2/L3 cache misses are preferable by any means (all coders should always try to minimize them), but with hyperthreading and sophisticated out-of-order execution, the CPU can usually fill most of the stall cycles (as long as cache misses are not happening too often).
1960*1080 pixels * (4 bytes diffuse color + 4 bytes normal map + 2 bytes specular intensity + 2 bytes specular power) * 2 samples average ansiotropic filtering = 50 MB texture lookups. Then there's multi-texturing, as well as other textures I missed (heightmap for relief mapping, light map, etc.). Finally, there's the screen buffers, which for a modern deferred render can take hundreds of MB (start with 16 byte pixels for HDR, add in position data and color for perhaps 4 lights per pixel, don't forget the pixel normals, specular color and power and intensity, and several other things besides. It gets big fast.). There's absolutely no way all that will fit in cache, not even close. This means multiple L3 cache misses per pixel, which as I mentioned before each cost at least 800 instructions.
All games use texture compression. A common material can for example contain two BC3 (DXT5) textures + one BC5 (BC5 = two DXT5 independent alpha channels). That kind of setup can store 10 channels. For example: RGB color, normal vector (2d tangent space), roughness, specular, ambient occlusion, opacity and height (for simple parallax mapping). That is 3 bytes per pixel. Multiply by 1920x1080 and you get 6.2 MB. That 50 MB figure for material sampling could be true for CGI offline renderers, but not for current generation interactive games.

We have been using deferred rendering since DX9 era (2007). I fully agree that the g-buffers can get really fat. The most optimal layout (depth + two 8888 g-buffers) barely fits to Xbox 360s 10 MB EDRAM, and it's slightly sub-hd (1152x720).

You don't need to store position data into g-buffer at all, since you can reconstruct that by using interpolated camera view vector and pixel depth (single multiply-add instruction). Normals are also often stored in 2d, and the third component is reconstructed (for example by using lambert azimuth equal area projection). Albedo color in g-buffer does not need to be in HDR, because g-buffer only contains the data sampled from the DXT-compressed material in [0,1] range (not the final lighted result).

A typical high end game could for example have D24S8 depth buffer (4 bytes) + four 11-11-10 g-buffers (4 bytes each). That's 20 bytes per pixel. If you also have screen space shadow masks (8888 32 bit texture contains four lights) and on average have 12 lights for each pixel, you need to fetch 32 bytes for each pixel in the deferred lighting shader. One x86 cache line can hold two of these input pixels during lighting pass. The access pattern is fully linear, so you never miss L1. The "standard" lighting output buffer format is 16f-16f-16f-16f (half float HDR, 8 bytes per pixel). All the new x86 CPUs have CVT16 instruction set, so they can convert a 32bit float vector to a 16bit float vector in a single instruction. Eight output pixels fit in a single x86 cache line, and again the address pattern is fully linear (we never miss L1, because of prefetchers). Of course you wouldn't even want to pollute the L1 cache with output pixels and use streaming stores instead (as you can easily generate whole 64 byte lines one at a time).

The GPU execution cycles are split roughly like this is current generation games (using our latest game as a example):
- 25% shadow map rendering (we have fully dynamic lighting)
- 25% object rendering to g-buffers
- 30% deferred lighting
- 15% post processing
- 5% others (like virtual texture updates, etc)

Deferred lighting and post processing are pure 2d passes and easy to prefect perfectly (all fetched data in L1 + streaming stores). In total those passes take 45% of the frame time. Shadow map rendering doesn't do any texture sampling, it just writes/reads depth buffer. 16 bit depth buffer (half float) is enough for shadow maps. 2048x2048 shadow map at 2 bytes per pixel is 8 MB. The whole shadow map fits nicely inside the 15 MB Sandy Bridge E cache (Haswell E will likely have even bigger caches). So, no L3 misses at all in shadow map rendering. Tiled shadow map rendering (256 kB tiles = 512x256 pixels) fits completely inside Sandy/Ivy/Haswell L2 cache of the CPU cores (and thus all cores can nicely work on parallel without fighting to share the L3). Shadow map vertex shaders are easy to run efficiently on AVX/FMA (just 1:1 mapping, nothing fancy, 100% L1 cache hit). So basically the only thing that doesn't suit CPU rendering that well is the object rendering to g-buffers. And that's only 25% of the whole frame... and even that can be quite elegantly solved (by using a tile based renderer). I'll write a follow up on that when I have some more time :)
There's a very good reason software renders are too slow for anything more advanced than PS2 level graphics.
Only good for PS2 level graphics? Many best looking PS3 games use similar techniques to do deferred lighting with CPU (Cell SPUs). Larrabee was also doing software rendering on x86 cores, and the game performance was (slightly) faster than current generation consoles (PS3 & Xbox 360). Haswell-E should have comparable single precision flops to Larrabee (two 256 bit FMAs per cycle, 8 cores, 4 GHz = 1024 GFLOP/s), so it should be much better than PS2 in rendering, and likely also beat PS3 (but only slightly, like Larrabee did).
Beyond the benefit the L1 cache offers for local reuse, e.g. in filtering, the cache hierarchy is unused.
Same is true for GPUs. Texture caches are very small, because their main purpose is filtering (and keeping the unused pixels of the 4x4 DXT blocks). The new GPUs however have L2 caches as well. Kepler has 512 kB L2, and GCN has 768 kB. Usually when an object is rendered to screen, it has some internal overdraw (at least the backside polygons are sampling the same z-buffer values as the frontside polygons). If you are doing some kind of tile binning (PowerVR TBDR, or CPU based tiled rendering), you can keep the depth buffer and color buffer of the tile in the L2/L3 cache of the CPU. That would reduce the deferred rendering bandwidth requirements a lot (overdraw only needs to access caches). This is similar how Intel implemented their software renderer on Larrabee (the precursor for Xeon Phi).
Another point this brings up is that in order to get anything resembling decent performance from a CPU, even for something as "simple" as 3D rendering, you have to do some really hairy optimizations. Such as figuring out how to prefetch data dependent stuff for (arbitrary!) shaders to avoid eating those 800 instructions worth of latency from any instruction depending on the result of all those L3 cache misses
This is true. It's very hard to manually prefetch texels for example for shadow mapping. Many PS3 games use SPUs to do lighting. The preferred way is to gather shadow map samples to screen space buffer with GPU and sidestep the issue completely. All screen space buffers are trivial to prefetch. This way deferred lighting can be processed on CPU without any cache misses at all (all data is always in L1). Of course a fully CPU-driven renderer needs to solve this issue. Sandy/Ivy/Haswell-E all have enough L3 cache to store a whole shadowmap, so this is not that big issue to them (compared to 256 kB local memory of Cell SPU). But L2 execution would of course be preferable (and that could be doable with tricks such as virtual shadow mapping).
 
In what test is Fermi better than K20 or K20x?
Please don't pull things out of context. It was an architectural discussion, so there's no direct comparison between K20 and any Fermi chip. It's a well known fact that Kepler's scheduling is more static, which hurts GPGPU performance. The GTX 680 loses against the GTX 560 Ti 448 and even a quad-core CPU. So optimizing for latency does matter to throughput-oriented chips.
 
Do you have any non-OpenCL benchmarks to prove that point? Or do you believe there's something about OpenCL that reduces Kepler's performance, something that doesn't apply to CUDA?
 
Nick said:
Please don't pull things out of context.
I didn't.

Nick said:
It was an architectural discussion
And K20 is representative of Kepler's GPGPU compute architecture...

Nick said:
The GTX 680 loses against the GTX 560 Ti 448 and even a quad-core CPU
Wait, I thought it was an architectural discussion... I'm confused.
 
Show some evidence that GPUs, especially consumer grade GPUs, have been shifting more and more die area to moving data around instead of ALUs.. especially to the point where they're now not even the major part of the die area. Because every time a shrink happens I see the execution unit count skyrocket while the instruction scheduling doesn't get any more sophisticated and the latency hiding remains the same simple round robin multi-threading. With Kepler nVidia even took a step in the opposite direction you're claiming.
Cypress - 2720 GFLOPS / 2154 Mtrans
Cayman - 2703 GFLOPS / 2640 Mtrans
Tahiti - 4300 GFLOPS / 4313 Mtrans

At 45 nm, single-precision FMA units have a size of about 0.0065 mm²/GFLOP, so in Cayman they should only occupy about 4% of the die space. Of course there's more to GPUs than FMA units, but it does put their primary performance metric into perspective. It would have only costed 4% more die area to make Cayman capable of 5.4 TFLOPS. The reason this wasn't done is simply because they can't feed that many ALUs. In fact it's already a fairly 'bursty' architecture; it only reaches 2.7 TFLOPS momentarily. It was beaten by the GF110 with barely over half the peak performance.

Kepler also takes the 'bursty' route. GK104 offers nearly twice the peak performance over GF110, but nowhere near such a leap in actual graphics performance, and compromises have been made to GPGPU efficiency. Granted, it makes it an outlier when looking at the compute density evolution. This is by all means the right move, for now, for a company that doesn't make any money from GPGPU in the consumer market.

Last but not least, keep in mind that there was a stark increase in the ALU:TEX ratio of shaders. So earlier GPU architectures spent ever more die area on compute units. The increase in arithmetic throughput was nothing short of impressive, but they've now reached the limit of die area that can be spent on it while still being able to feed all the ALUs with data, some of the time. So the future of scaling GPU performance is quite bleak. They have to spend more area on on-die storage (registers and cache) and/or increase the data locality by decreasing the number of threads, which can only be achieved through more dynamic scheduling.
Yes, this hurts some GPGPU workloads.. but according to you GPGPU's days are numbered so that means the GPU vendors can go back to focusing solely on what's good for mainstream graphics. Of course you probably think the GPU's days are numbered too..
Yes, although GPUs certainly won't die out overnight. It highly depends on how (real-time rasterization) graphics-oriented a device is. We're already starting to see some cracks. Ever more people are using an integrated GPU. Discrete GPUs are becoming more graphics oriented to stay relevant a while longer. NVIDIA is wisely taking its first steps in CPU design. You can compare all this to the slow demise of dedicated sound processing. First you had to have a discrete cards. Then the chips became integrated into the chipset. The sound card manufacturers reacted by aiming at a nice market with more specialized products. Today sound cards still exist, but for the vast majority of systems sound processing is done by the CPU. Of course graphics is far more demanding than sound, but there is a point where CPUs will become sufficiently powerful for high DLP workloads that graphics won't require dedicated cores.
Another area which has been pushing opposite the trends you claim is consoles, where over and over again latency sensitive and irregular code patterns is sacrificed for data regular throughput. It remains to be seen what all the console companies will do this generation but so far Nintendo has chosen to pair a traditional GPU that's several times larger than it's CPU so that's one answer.
The trends I claim are universal but they're going to affect desktops first. There are tons of business computers that have very low graphics demands, which could save the cost of an integrated GPU once the CPU cores offer adequate throughput. Consoles are at the complete opposite side of the spectrum. They're about the last devices I expect to feature unified cores. Heck, the PS3 doesn't even have a unified GPU architecture yet. It's a safe bet to assume that the PS4 will though, so there's still convergence going on at this end too. I don't see any strong evidence of opposite trends. Even the Wii U got a multi-core CPU and a unified GPU.

You see, I think putting relatively more focus on data regular throughput instead of latency sensitive performance is the right thing to do. CPU and GPU have to meet each other in the middle. There's tons of general purpose code out there where some of the major hotspots are in loops with independent iterations. It's faster and more power efficient to execute this using AVX2+ and downclocking the core. SMT is also a good thing, that's why we have Hyper-Threading. Long-running vector instructions would also break away from legacy CPU paradigms, to hide more latency and lower the scheduling cost. This isn't any "opposite" trend. It's convergence, which will result in unification.
 
Do you have any non-OpenCL benchmarks to prove that point? Or do you believe there's something about OpenCL that reduces Kepler's performance, something that doesn't apply to CUDA?
Here's a CUDA one where Kepler also loses against Fermi.

Of course there are benchmarks where Kepler is faster, but it's obvious from these mixed results that general-purpose throughput-oriented chips also have to care about latency-related optimizations, despite Kepler's clear superiority in theoretical throughput. And this is only going to get more pressing. In several years we could have chips with ten times higher theoretical computing power, but it won't have ten times more bandwidth. Instead a substantial improvement in on-die cache hits will be required, which requires larger caches with high data locality, which in turn demands advanced scheduling.

Hence throughput-oriented chips become more CPU-like, while at the same time CPUs are becoming more GPU-like by using wider SIMD units, FMA, gather, etc.
 
Back
Top