Software/CPU-based 3D Rendering

Uh, no. The reason drivers have issues is that rendering is a very tricky subject, and that drivers have a very wide scope of things they do.
I'm not disputing that rendering is a complex subject, but being limited to what various versions of drivers properly support is not helping. Various people have expressed the desire to bring the whole thing into user space. IHVs can still provide their 'driver', in the form of an ordinary library, and it can be supplemented with custom ones. It would enable developers to pick a version that works for them, and be certain that it works on the customer's systems. Larrabee was pretty close to achieving that very desirable goal.
Writing a good renderer is quite difficult, which is why most games these days don't have their own rendering engines - they use someone else's, and that's with APIs that do most of the hard stuff for you!
Writing a renderer is becoming increasingly more similar to writing a compiler. But that doesn't mean that when a CPU-like chip will take over GPU-like functionality, we're forcing everyone to write his/her own compiler. The CPU world has a thriving ecosystem of developers who deliver different pieces of the puzzle. Plenty of people like doing the "hard stuff" like writing compilers, or renderers, so that others don't have to. It doesn't make developing for the CPU any harder than developing for the GPU. On the contrary. Currently the software layer that's provided by the hardware manufacturer is a limitation and a source of complications.
The only thing a software renderer would change would be that instead of buggy drivers, the bugs would find their way into the commercial engines instead, and since there are more commercial engines than hardware vendors, you'd have the same total number of developers having to fix these bugs separately more times, so it would probably be worse. All you'd be doing is moving the difficult code from one platform onto another.
Not really. Today's situation is that application developers have to work around bugs (both in functionality and performance) of a whole range of driver versions. Heck, your application can break because of a driver update. If you're a small fish, like most software companies, then the IHVs won't bother fixing things in a hurry for you.
 
Nick said:
Once we can agree on that, we can start concluding what it might mean to the future of GPUs.
This is so far out in left field I really don't know what to say. It would appear your grasp on rationality is tenuous at best. It is certainly clear that there is no reasoning with you on this topic. I am not sure what you are trying to achieve with these posts, but I do hope things work out well for you.

Peace.
 
Isn't this just a repeat of what happened in the 90s? And by that I mean at first the cpu did the calcs untill dedicated 3d grafix hardware came along. Now things seem to be going back to the cpu doing most of the heavy pulling.
 
you could argue that with physx/opencl/directcompute the heavy pulling is going in the direction of the gpu
 
you could argue that with physx/opencl/directcompute the heavy pulling is going in the direction of the gpu
You could argue that a few years ago, but uptake of the above outside of graphics applications and marketing has not been... uhh... impressive. DirectCompute has been the most useful in practice, and it is pretty much exclusively used to just do slightly more general graphics applications.
 
GPUs have big raw floating point peaks, but the integer processing capacity of GPUs hasn't been discussed that much. And it's an area where CPUs still have an advantage. Kepler's 32-bit integer multiply is 1/6 rate, bit shifts are 1/6 rate. Fermi is slightly better (1/4 rate for integer mul & shift). Algorithms / structures used on GPU programs are getting more and more sophisticated (kd/oct/quad/binary/etc trees, hashing, various search structures, etc), and traversal of these structures is mostly integer processing.

AVX2 doubles the CPU integer processing capability (to full width 256 bit registers). AVX2 also makes SIMD integer processing more useful, because now SIMD can be efficiently used also for memory address calculation, because gather can directly use SIMD register contents as memory offsets (instead of requiring eight separate register moves and load operations). CPU vector processing also supports lower precision integers (8 x 32 bit / 16 x 16 bit / 32 x 8 bit). GPUs on the other hand have "fixed width" SIMD (Nvidia has 32 lanes, AMD has 64). Many algorithms (image processing, image decompression, etc) do not need wider than 8/16 bit integers. CPUs can process these at 2x or 4x rate. GPU wastes cycles and perf/watt by using needlessly wide 32 bit integer registers/operations for all integer math.

Not that it changes things too much, but GCN can actually do 24-bit integer ops (including integer MADD) at full speed. Although I don't think they're directly accessible through HLSL or GLSL. They also added a 4x1 SAD instruction that works with 32-bit colors.
 
It's worth noting that both GK110 and Tahiti have *global* memory bandwidth in the 280 GB/s range, which is roughly equal to the L2 bandwidth for Sandy Bridge for all cores combined (quad core), and far higher than the L3. Hence, the lack of large high level caches doesn't hurt GPUs nearly as much as you'd think.

It's interesting to note how very much faster GDDR5 is compared to DDR3. Both the current gen 6-core i7 (LGA2011) and GTX 680 have 4 memory channels, yet somehow the 680 manages around 180 GB/s while the i7 gets 40 GB/s. This is a 4.5x difference. Does the latency optimization of the DDR3 really hurt it that much, or is there more in play? If it really *is* latency optimization related, this is very bad for latency sensitive processors like CPUs, since they won't be able to compete in memory bandwidth, and L2 caches are a poor substitute for equal performance RAM.
 
Last edited by a moderator:
GPUs on the other hand have "fixed width" SIMD (Nvidia has 32 lanes, AMD has 64). Many algorithms (image processing, image decompression, etc) do not need wider than 8/16 bit integers. CPUs can process these at 2x or 4x rate. GPU wastes cycles and perf/watt by using needlessly wide 32 bit integer registers/operations for all integer math.

This is incorrect. Check the SIMD Video Instructions in the PTX manual. These operate on both 8 and 16 bit integers.
 
This is incorrect. Check the SIMD Video Instructions in the PTX manual. These operate on both 8 and 16 bit integers.
Chapter 5.2.2. of NVidia PTX ISA documentation states:

"The .u8 and .s8 types are restricted to ld, st, and cvt instructions. The ld and st
instructions also accept .b8 type. Byte-size integer load instructions zero- or signextended the value to the size of the destination register."

You cannot perform any ALU calculation on 8 bit integer types (only load/store/convert is available). There are no native 8 bit registers. 8 bit types are loaded to larger registers (zero- or sign-extended).

Type Conversion chapter in the CUDA programming guideline also states:

"Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for: Functions operating on variables of type char or short whose operands generally need to be converted to int". This pretty much means that CUDA generally processes most integer math in 32 bit registers.

The newest CUDA C programming guide (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput) includes CUDA 3.5 (Kepler K20). K20 64 bit float multiplication is 1/3 rate, but 32 bit integer multiplication is still only 1/6 rate. 32 bit integer shifts are faster (now 1/3 rate).
Not that it changes things too much, but GCN can actually do 24-bit integer ops (including integer MADD) at full speed. Although I don't think they're directly accessible through HLSL or GLSL. They also added a 4x1 SAD instruction that works with 32-bit colors.
Did they mention the throughput of the SAD instruction? K10/K20 have only 1/6 SAD rate (same as 32 bit integer multiply).

24-bit integer operations are usually good enough for graphics processing. Sometimes the 16M maximum integer value is a limitation, for example when calculation addresses to big linear (1d) memory resources, but often you can use 2d resources on GPUs, and do addressing with a pair of 24 bit integers (of course that requires twice as much integer math for address calculation).

You can also do limited set of 24 bit integer operations in the 32 bit floating point registers (as 32 bit floats have 24 bit mantissa). This is a trick often used in current generation consoles (pre DX10, no native integer processing). As long as you ensure that there's no overflow or underflow, you can add/subtract/multiply 24 bit integers in the floating point registers. Shifting of course is possible also (multiply by 2^n or 1/(2^n)). You need an additional floor instruction in the down shift case (this is safe, since float renormalization kicks out the same bits as the shifting would zero). For division, the underflow is a bigger problem (you lose N highest bits, where N is number of bits in the divider). So it's only practical for small values (often constants). Divide requires a floor instruction as well. Bit masking and logical operations are not possible when using floating point register hack (but if you know that some bits are clear, add can be used instead of or, so you can do bit packing).

On older (WLIV/vector+scalar) GPUs bit packing is actually pretty fast with a 3d dot product. For example packing a 565 integer color (three float registers containing 24 bit integers) to one can be done with a single dot product. float result = dot(vector4(2048, 32, 0), colorInt.rgb); Now the result contains a 16 bit integer that can be for example stored to a 16 bit integer render target (without any loss in precision). Similar trick can be used to bit pack the 2/3 bit indices of DXT blocks to 16 bit outputs (in GPU DXT compressors).
It's worth noting that both GK110 and Tahiti have *global* memory bandwidth in the 280 GB/s range, which is roughly equal to the L2 bandwidth for Sandy Bridge for all cores combined (quad core), and far higher than the L3. Hence, the lack of large high level caches doesn't hurt GPUs nearly as much as you'd think.
It's not exactly fair to compare a 4 core Sandy/Ivy Bridge to a GPU, since half of the CPU die is reserved for a integrated GPU. You should be comparing the 8 core parts instead, as they are pure CPUs with no integrated GPU sharing the space/power/heat budgets.

Haswell doubles the cache bandwidths. L2 bandwidth per core is 64 bytes per cycle. 8 cores at 4 GHz results in 64 bytes / cycle * 4 GHz * 8 = 2048 GB/s L2 cache bandwidth. That's over 10x the GDDR5 bandwidth of Kepler (Geforce 680), and 7.3x of the GDDR5 bandwidth of Tahiti.
It's interesting to note how very much faster GDDR5 is compared to DDR3. Both the current gen 6-core i7 (LGA2011) and GTX 680 have 4 memory channels, yet somehow the 680 manages around 180 GB/s while the i7 gets 40 GB/s. This is a 4.5x difference. Does the latency optimization of the DDR3 really hurt it that much, or is there more in play? If it really *is* latency optimization related, this is very bad for latency sensitive processors like CPUs, since they won't be able to compete in memory bandwidth, and L2 caches are a poor substitute for equal performance RAM.
I have tried to find more concrete information about this topic as well, but GDDR hasn't been used in commercially available PC/server CPUs. There's not much information about GDDR latencies in CPUs. Xbox 360 uses GDDR3 as it's main memory, and it has 500+ cycle memory latency (source: http://forum.beyond3d.com/showthread.php?t=20930). Memory latencies of similarly clocked PC CPUs with DDR2/DDR3 are in 150-250 cycle range. However PS3 PPU has also similar (400+ cycle) memory latency, but uses XDR memory (http://research.scee.net/files/pres...ls_of_Object_Oriented_Programming_GCAP_09.pdf). It could be that the memory subsystems of these simple speed demon PPC cores hadn't been that well optimized for latency. Xeon Phi is the most recent CPU that uses GDDR5. Is there any official Xeon Phi documentation out yet (that would describe the expected memory latencies)?
 
Chapter 5.2.2. of NVidia PTX ISA documentation states:

"The .u8 and .s8 types are restricted to ld, st, and cvt instructions. The ld and st
instructions also accept .b8 type. Byte-size integer load instructions zero- or signextended the value to the size of the destination register."

You cannot perform any ALU calculation on 8 bit integer types (only load/store/convert is available). There are no native 8 bit registers. 8 bit types are loaded to larger registers (zero- or sign-extended).

Type Conversion chapter in the CUDA programming guideline also states:

"Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for: Functions operating on variables of type char or short whose operands generally need to be converted to int". This pretty much means that CUDA generally processes most integer math in 32 bit registers.

I'm looking at chapter 8.7.13, which lists all the SIMD video instructions - I see vadd, vsub, vavrg, vabsdiff, vmin, vmax, and vset, each with 8 and 16 bit varients. They all operate on ordinary 32 bit registers (hence SIMD), so they would be of type .u32 or whatever. I don't believe they're actually exposed in CUDA C, so you'd have to use inline ASM.

It's not exactly fair to compare a 4 core Sandy/Ivy Bridge to a GPU, since half of the CPU die is reserved for a integrated GPU. You should be comparing the 8 core parts instead, as they are pure CPUs with no integrated GPU sharing the space/power/heat budgets.

My benchmarks say otherwise: http://forums.anandtech.com/showthread.php?t=2235413

I get 528 GB/s for the 6-core extreme edition, which is a $1000 CPU. You could buy two HD7970s for that price, while the $300 LGA 2011 motherboard would cover a decent CPU+motherboard to go with them. This means that the GPUs actually give about the same global memory bandwidth/$ as the CPU has L2 bandwidth/$.

I have tried to find more concrete information about this topic as well, but GDDR hasn't been used in commercially available PC/server CPUs. There's not much information about GDDR latencies in CPUs. Xbox 360 uses GDDR3 as it's main memory, and it has 500+ cycle memory latency (source: http://forum.beyond3d.com/showthread.php?t=20930). Memory latencies of similarly clocked PC CPUs with DDR2/DDR3 are in 150-250 cycle range. However PS3 PPU has also similar (400+ cycle) memory latency, but uses XDR memory (http://research.scee.net/files/prese...ng_GCAP_09.pdf). It could be that the memory subsystems of these simple speed demon PPC cores hadn't been that well optimized for latency. Xeon Phi is the most recent CPU that uses GDDR5. Is there any official Xeon Phi documentation out yet (that would describe the expected memory latencies)?

One partial reason is that DDR3 does 64 bit transfers, which since the CPU has to send out a 64 bit address for each transfer, amounts to only 1/2 of the bandwidth actually being used for data. GDDR5 uses 256 bit transfers, which with 64 bit addresses allows 4/5 of the total bandwidth to be data. I expect it would be a very difficult problem to try to only use, say, 40 address bits, since you'd run into nasty stride issues. Still, this doesn't come close to closing the gap in performance...

Honestly, I don't understand why DDR3 uses such small transfers - I mean, a cache line will be 128b or even 256b...
 
Last edited by a moderator:
...6-core extreme edition, which is a $1000 CPU. You could buy two HD7970s for that price, while the $300 LGA 2011 motherboard would cover a decent CPU+motherboard to go with them. This means that the GPUs actually give about the same global memory bandwidth/$ as the CPU has L2 bandwidth/$.
I doubt the pure 8 core CPUs are that much more expensive to produce compared to the 4 core + HD 4000 models (as the die sizes are not that different).

Intel can price the high end models as high as they want, since they have no competition in high end. Professional customers will pay for the extra performance. Gone are the days when almost every gamer bought the highest end CPU available. I remember building three Athlon 1400 MHz based computers (for my friends) when it was the fastest CPU around (it even beat all the Intel Xeon server chips). How that the megahertz race is over, and the free lunch is over, we no longer get automatic performance scaling for software. The consumers no longer see that much benefit in high end CPUs. Most consumer applications (and current generation games) do not scale to more than 2-4 threads/cores. Most programmers are still producing code using old fashioned programming models (single threaded, or static multithreaded without scalable DLP). This is the biggest problem for multicore CPUs. The performance difference isn't worth the money anymore (for consumer usage patterns).

If most programmers would use better programming models that would automatically scale to any amount of cores, we could see much higher demand for multicore CPUs, and that would drive the prices down. Now Intel is mainly selling the 6-8 core CPUs for professional workstations and servers, and prices them accordingly. At least we get extra cores for the money... in the past all the extra we got (in Xeons) was support for ECC memory :)
 
One partial reason is that DDR3 does 64 bit transfers, which since the CPU has to send out a 64 bit address for each transfer, amounts to only 1/2 of the bandwidth actually being used for data. GDDR5 uses 256 bit transfers, which with 64 bit addresses allows 4/5 of the total bandwidth to be data. I expect it would be a very difficult problem to try to only use, say, 40 address bits, since you'd run into nasty stride issues. Still, this doesn't come close to closing the gap in performance...

Honestly, I don't understand why DDR3 uses such small transfers - I mean, a cache line will be 128b or even 256b...

That's per chip. CPU cache lines are now almost universally 512b, and in a normal 64-bit DDR3 interface, one transfer will completely fill a single cache line. (8*8bit chips, 64 bit transfer each = 512 bits)

For random access, increasing the size of accesses no longer increases speed at all.
 
The reliance of GPUs on static partitioning of register files based on worst-case analysis of kernels that are known in advance is a significant weakness for irregular code compared to register renaming and L1$ spill/fill.

What modern PC GPU does static register file partitioning? AMD and Nvidia are dynamic per kernel. Or perhaps that is what you meant. But multiple kernels in flight can use different sized partitions. Register renaming and L1$ spill/fill come with power issues. Any solution is a tradeoff in the multi-dimensional implementation space.
 
What modern PC GPU does static register file partitioning? AMD and Nvidia are dynamic per kernel. Or perhaps that is what you meant. But multiple kernels in flight can use different sized partitions. Register renaming and L1$ spill/fill come with power issues. Any solution is a tradeoff in the multi-dimensional implementation space.
I think Andrew meant that you have to do static analysis on the kernel, and preallocate registers based on worst case register allocation required in the whole kernel. This isn't that bad for simple graphics shaders, but if you are running shaders worth of several thousand instructions, and your thread count must be limited because of a single peak register usage in the shader, you are in pretty bad shape (having to find the peaks and optimize them by hand in order to not stall the code in other places).

GPU could for example precalculate register allocation for each section of code between a barrier, and switch register allocations on barriers. GPU needs to run all threads (on a single thread block) to the barrier before it can continue. This would be a good place to switch the register allocation without any problems. Of course this is static as well, but at least should be better than having a single register allocation for a whole kernel (that might be thousands of instructions).
 
I doubt the pure 8 core CPUs are that much more expensive to produce compared to the 4 core + HD 4000 models (as the die sizes are not that different).

There's no 8-core IB so one can't really compare but SB-E weighs in at 435mm^2 which is over twice as large as SB (w/HD 3000) at 216mm^2.

IB spends proportionately more of its die area on IGP but still nowhere close to half, so I think IB-E will also be much larger than IB.

Of course, when looking at the viability of using CPUs for rendering you can't just compare the entire CPU to a GPU because part of the CPU load will still be needed to perform the non-rendering tasks required by games..
 
There's no 8-core IB so one can't really compare but SB-E weighs in at 435mm^2 which is over twice as large as SB (w/HD 3000) at 216mm^2.

IB spends proportionately more of its die area on IGP but still nowhere close to half, so I think IB-E will also be much larger than IB.
IVB-E will be out next year according to Intel's roadmaps. The graphics chip in Ivy Bridge is less than 40% of the die space (according to some die shots). So I agree with you that a 8 core IVB-E would likely be slightly larger than IVB, but the difference should be much less than it's with Sandy Bridge.

However it seems that we are getting up to 14 core IVB models: http://news.softpedia.com/news/Hasw...with-14-Cores-and-4-Channel-DDR4-292122.shtml . The dies of these chips will surely be massive (compared to consumer products). IVB-E will officially support DDR3-1833, improving the memory bandwidth to 59.712 GB/s. Haswell E/EX available memory configurations are still not known (but could be even better, as those 16-20 cores surely would need more than is currently available).
 
It's interesting to note how very much faster GDDR5 is compared to DDR3. Both the current gen 6-core i7 (LGA2011) and GTX 680 have 4 memory channels, yet somehow the 680 manages around 180 GB/s while the i7 gets 40 GB/s.
That's not really an interesting comparison though as the GDDR5 is not only longer latency but it's consuming a hell of a lot more power. I've said this several times before but if you're going to make any architectural claims about what's better (for graphics or whatever) in the future, you have to normalize on power usage.

I get 528 GB/s for the 6-core extreme edition, which is a $1000 CPU. You could buy two HD7970s for that price, while the $300 LGA 2011 motherboard would cover a decent CPU+motherboard to go with them. This means that the GPUs actually give about the same global memory bandwidth/$ as the CPU has L2 bandwidth/$.
Right and as above, that's relevant if you are deciding to buy hardware today that need that uncachable global bandwidth, but fit into GPUs comparatively tiny RAM sizes. If you're talking architecture long term though, the SKU prices are irrelevant and arbitrary... might as well be using Tesla prices for the GPUs to be equivalent to the SNB-E pricing.

What modern PC GPU does static register file partitioning? AMD and Nvidia are dynamic per kernel. Or perhaps that is what you meant.
Yes that's what I meant - ultimately we have to drop the concept that the GPU must know - before launch of a kernel - the exact register file requirements of the kernel, and thus by extension the entire set of potentially callable code for that kernel.

Register renaming and L1$ spill/fill come with power issues. Any solution is a tradeoff in the multi-dimensional implementation space.
No doubt, but IMHO it's one of the more crippling ones that helps to relegate current GPUs to fairly simple kernel problems rather than running more significant chunks of code, and forces software to do various expensive (power-wise) gymnastics to run more complicated algorithms. I just don't see GPUs expanding much beyond their current usage if we maintain the requirement that all code effectively has to get inlined and optimized as such.

Of course, when looking at the viability of using CPUs for rendering you can't just compare the entire CPU to a GPU because part of the CPU load will still be needed to perform the non-rendering tasks required by games..
Right, and there's quite a nontrivial amount of work done on the CPU in graphics/GPU drivers as well (typically eats at least one entire thread), so that needs to be factored in as well.
 
You could argue that a few years ago, but uptake of the above outside of graphics applications and marketing has not been... uhh... impressive. DirectCompute has been the most useful in practice, and it is pretty much exclusively used to just do slightly more general graphics applications.

What's your take on game physics? Not much needed so CPU is fine? Or have the GPU physics efforts not panned out?
 
Did they mention the throughput of the SAD instruction? K10/K20 have only 1/6 SAD rate (same as 32 bit integer multiply).

It sounds like it's full rate. Here's what it says in the GCN whitepaper:

GCN also adds new media and image processing instructions in the vector ALUs. Specifically, there are two new instructions: a 4x1 sum-of-absolutedifferences
(SAD) and quad SAD that operate on 32-bit pixels with 8-bit colors. Using these new instructions, a Compute Unit can execute 64 SADs/cycle
which translates into 256 operations per clock.
 
Yes that's what I meant - ultimately we have to drop the concept that the GPU must know - before launch of a kernel - the exact register file requirements of the kernel, and thus by extension the entire set of potentially callable code for that kernel.


No doubt, but IMHO it's one of the more crippling ones that helps to relegate current GPUs to fairly simple kernel problems rather than running more significant chunks of code, and forces software to do various expensive (power-wise) gymnastics to run more complicated algorithms. I just don't see GPUs expanding much beyond their current usage if we maintain the requirement that all code effectively has to get inlined and optimized as such.

Cuda has supported recursion and proper function calls since Fermi. Only the first gen Cuda cards forced inlining.

As for register allocation, CPUs are also static. For one thing, you can't address into a register, so the register number is hardcoded into the instruction. This is the difference between a register file and a memory scratchpad. As for register spill, GPUs handle it in exactly the same way as CPUs - by dumping things out to memory. For what it's worth, register renaming can't prevent register spilling.

That's not really an interesting comparison though as the GDDR5 is not only longer latency but it's consuming a hell of a lot more power. I've said this several times before but if you're going to make any architectural claims about what's better (for graphics or whatever) in the future, you have to normalize on power usage.

So why then are GPUs able to fit this extra power into their budget but CPUs are not? It would seem to me that, at least for the high end, CPUs would use the faster, more power hungry memory, assuming there wasn't some other issue preventing it.

As for power consumption, my numbers for a system with a high end CPU and GPU, running at 100% load 24/7 works out to about $200 per year, so power consumption isn't really all that important, especially since even gamers aren't likely to pull even 10% of this. Thus, the question is about performance, to the greatest extent possible without melting anything.

For mobile, the landscape is a bit different, but not in the favor of CPUs - if you've ever looked at a die shot of a mobile APU (such as this A6X, used for the ipad 4: http://images.anandtech.com/reviews/tablets/apple/ipad4/a6x.jpg - that die is something like 170 mm^2), you'll notice that the die is dominated by special fixed function hardware. CPUs just aren't power efficient enough to do more load then they absolutely have to.

Right and as above, that's relevant if you are deciding to buy hardware today that need that uncachable global bandwidth, but fit into GPUs comparatively tiny RAM sizes. If you're talking architecture long term though, the SKU prices are irrelevant and arbitrary... might as well be using Tesla prices for the GPUs to be equivalent to the SNB-E pricing.

Fermi and later can access memory across the PCIe bus, so if you really need it, it's there (though you'll only get maybe 12 GB/s for PCIe 3, iirc, and you have all sorts of latency and transfer size issues). Still, if you look at the VRAM as a 2 GB L3 cache, with about 4x the bandwidth of a CPU's L3, it looks pretty nice. I don't believe that there are many algorithms that use this amount of memory that you can't do in an out of core way?
 
Last edited by a moderator:
Back
Top