Fast software renderer

Nick · Aug 5, 2009

Voxilla said:
For fast few threaded CPU execution you need costly branch prediction, out of order and in the case of x86 overly complex instruction decoders resulting in fat cores.

None of those things result in fat cores. Branch prediction was introduced in x86 processors with the Pentium I, which had a total of 3.1 million transistors. While branch prediction has evolved since, it's hard to imagine that on a 730 million transistor Core i7 it's the thing making it 'fat'. And the need for x86 decoders hasn't stopped Intel from making Larrabee largely compatible with it either. Since the end of the MHz race and the introduction of multi-core CPUs the whole idea of making CPUs faster by making them fatter has been completely abandonned.

GPUs on the other hand have evolved to include ever more control logic. And while once they had hardly any registers or caches those structures now take up significant die space.

They will continue to grow closer, simply because developers demand ever more flexibility from a GPU, and because of physical limitations. Think about programmable texture filtering, which requires replacing full-fledged texture units with multiple gather units (which also helps compute shaders). And both CPUs and GPUs are striving to maximize performance per Watt, which the CPU does by increasing ALU density and the GPU does by adding more control logic to increase sheduling efficiency and making fine-grained branching more efficient...

For massive parallel GPU execution you need loads of simple processors and these are no good for single or few threaded loads.

Things like compute shading and raytracing demand more complex cores capable of running many branchy threads entirely independently. If GPUs really want to expand their application fields they'll inevitably go head-to-head with CPUs and their architectures will look more alike. And at the same time nothing is stopping CPUs from becoming more capable of running graphics.

Voxilla · Aug 5, 2009

Nick said:
The exact approach is described here: John Carmack on Doom3 Rendering. It's pretty old technology by today's standards, but it already included some arithmetic work.

True, I don't know the details, but forthcoming Megatexturing sound pretty texture intensive too.

Nick said:
Also, the depth and stencil passes are purely ROP limited. So when you would run this game on a software renderer, texturing won't be the sole bottleneck any more.

That is correct for Z-buffer/stencil buffer style rendering. With my scanline type of rendering this would be piece of cake. I know shadow volumes are old stuff and now you do it with shadow buffers, and these again require some intricate texture sampling. Also SSOA is more texture lookup than arithmetic.

Nick said:
So no matter how many textures it's sampling, it's spending far more time doing other things. Trust me, the first time I did these kind of measurements I was surprised too. I had spent months optimizing texture sampling, and I expected it to still be the major bottleneck. But it simply isn't. The arithmetic workload has increased massively compared to texturing.

That is hard to understand as for software texturing you need dozens of arithmetic instructions and most shader arithmetic maps well to SSE code (except pow).
If those shaders have 100s arithmetic instruction for every texture access then it might not hurt that much, but that sound excessive or sloppy coding.
I don't know how good your are at branching as this also could be a killer.

Nick said:
We've seen the same evolution reflected in hardware architectures. In the fixed-function era a chip with multiple texture units per pipeline was no exception. Nowadays, a chip like the RV790 has 800 stream processors, 40 texture units and 16 ROPs.

Yes, but this evolution is more a result of hardware limitations with memory bandwidth not scaling as good as ALUs. And games have adapted to these limitations.

Nick · Aug 5, 2009

crystall said:
Texture decompression also comes to mind, the way it's done in current graphics hardware would fare very poorly in software. Different approaches would work well without specific decompression hardware and provide the same compression ratios as well as equal or better fidelity. Vector quantization comes to mind as it requires only indirect accesses for 'decompression' and no need for processing the fetched texel data. It can also be made fairly cache-friendly.

Exactly. Gather is useful for any situation where a lookup table is necessary or useful. That's also why I mentioned transcendental functions. CPUs have complex circuitry for divisions and square roots, and trigonometric functions use slow micro-code, while GPUs implements all of these and more with lookup tables and accurate interpolation.

In fact scatter/gather is also useful for parallelizing any loop. For each scalar arithmetic operations there's a parallel version, but not for loads and stores (to non-consecutive memory). This currently limits tons of applications from processing data in parallel, and it's the reason why automatic parallelization by the compiler hasn't really taken off yet.

The problem is that CPU manufacturers keep asking developers which instructions they could add to speed up very specific tasks. It's time they start thinking outside the box and add instructions that benefit everybody. Sure, it won't be as simple as slamming on another SSE extension, but it will have a lot bigger effect on real world performance.

Voxilla · Aug 5, 2009

Nick said:
None of those things result in fat cores. Branch prediction was introduced in x86 processors with the Pentium I, which had a total of 3.1 million transistors. While branch prediction has evolved since, it's hard to imagine that on a 730 million transistor Core i7 it's the thing making it 'fat'. And the need for x86 decoders hasn't stopped Intel from making Larrabee largely compatible with it either. Since the end of the MHz race and the introduction of multi-core CPUs the whole idea of making CPUs faster by making them fatter has been completely abandonned.

How would you explain the size difference between a core of the i7 and Larrabee ?
They both have 256KB L2 and 32KB L1D/I. Larrabee is much smaller yet it's vector units are 4 times wider.

3dilettante · Aug 5, 2009

Nick said:
None of those things result in fat cores. Branch prediction was introduced in x86 processors with the Pentium I, which had a total of 3.1 million transistors. While branch prediction has evolved since, it's hard to imagine that on a 730 million transistor Core i7 it's the thing making it 'fat'.

No one thing makes a core fat, though wide superscalar issue, OoO, and deep speculation are probably the biggest contributors.
This excludes the 1/2+ of the budget devoted to cache.
The new hotness, though, is the addition of ever more functionality on-chip, such as the memory controllers, specialty processors, and more I/O.
The uncore is the latest bloat source.

In relation to the Pentium core, the i7 branch prediction logic would probably count as "fat". Contemporary predictor tables and various specialty detectors and stacks probably have higher transistor counts than the Pentium caches.
There's probably more transistors on an i7 die related to guessing the next instruction pointer than there was on the Pentium chip.

And the need for x86 decoders hasn't stopped Intel from making Larrabee largely compatible with it either. Since the end of the MHz race and the introduction of multi-core CPUs the whole idea of making CPUs faster by making them fatter has been completely abandonned.

There wasn't any physical reason why it couldn't be done, and going by the rumors Intel had the die space to waste.
That being said, a Larrabee core from the POV of decoders is either 1/2 as wide as one Nehalem when using vectors, and 1/4 as wide in plain x86.
The idea of making CPUs faster by making them fatter has been curtailed due to power, but not abandoned. Every successive generation modestly increases the bulk of the mainline x86 cores.

GPUs on the other hand have evolved to include ever more control logic. And while once they had hardly any registers or caches those structures now take up significant die space.

They still don't speculate much, for better or worse. Not at the level Larrabee does, and most definitely not at the level of an i7.
This may be one area where there can't be a homogenous convergence.

Nick · Aug 5, 2009

Voxilla said:
That is hard to understand as for software texturing you need dozens of arithmetic instructions and most shader arithmetic maps well to SSE code (except pow).

Pixel shaders of thirty instructions and more are really common now, with only a few of the instructions being texture operations. But it's not just purely the increase in pixel shader arithmetic that caused texture sampling to take a back seat. Vertex shaders, gradient setup, rasterization, interpolation, resource management, etc. With modern games all these things take a significant slice of processing time when run entirely on a CPU.

Yes, but this evolution is more a result of hardware limitations with memory bandwidth not scaling as good as ALUs. And games have adapted to these limitations.

No. Games simply gradually use longer shaders with a higher ALU:TEX ratio. This presentation by AMD even shows a 13:1 ratio for a 2007 demo. So the 4:1 ratio we see in today's hardware is actually still on the conservative side.

So again I'm afraid your renderer's performance characteristics don't nearly represent those of modern games. However, I regard it as another great proof that texture sampling is no longer the CPU's Achilles' heel for 3D graphics. Achieving 160 FPS at a resolution of 2560x1600 is awesome. The remaining challenges are situated around transcendental functions, reading and writing data streams, setup and rasterization, etc. All of which would greatly benefit from scatter/gather hardware. The fact that it would also speed up texture sampling is a nice bonus.

Nick · Aug 6, 2009

Voxilla said:
How would you explain the size difference between a core of the i7 and Larrabee ?
They both have 256KB L2 and 32KB L1D/I. Larrabee is much smaller yet it's vector units are 4 times wider.

For starters, Core i7 has a huge 8 MB L3 cache, which has a very high transistor density (8T SRAM). It also has QPI links and other IO logic. So without all that we're probably down to 300 million transistors for 4 cores, or 75 million a core. Larrabee is rumored to have 1.7 billion transistors. With 32 cores, that's 50 million per core. Even if you account for texture samplers and a PCI interface, the difference isn't all that huge.

But let's also not forget that Core i7's cores can sustain an instruction rate of 4 per cycle, while Larrabee's cores are limited to 2 instructions per cycle. So taking all that into account, things like out-of-order execution and branch precision is not fat, it's muscle.

Furthermore, AVX will initially double the SSE vector width to 256 bit, but it can be extended up to 1024 bit. So ALU density can be increased a lot, at a modest transistor cost.

Does this mean that I believe Larrabee would have been better off with fat/muscular cores? No. But I do believe that it wouldn't have made it sluggish either. And so when you're looking at the low-end market, where 3D graphics isn't really a priority, the CPU can be quite capable of delivering adequate performance, with only some minor changes.

Nick · Aug 6, 2009

3dilettante said:
In relation to the Pentium core, the i7 branch prediction logic would probably count as "fat".

In absolute transistor counts, sure, but not relative to the transistor budget. A Core i7 measures 263 mm² and has four cores, while the P5 measures 294 mm² and has one core. And I'm pretty sure that i7's branch predictors combined don't take up as much area as a single P5 branch predictor. It's the same with x86 decoders. They haven't really gotten all that much bigger. And while Core i7 has 16 of them, P5 had only 2 (and Larrabee should have 64).

The idea of making CPUs faster by making them fatter has been curtailed due to power, but not abandoned. Every successive generation modestly increases the bulk of the mainline x86 cores.

Agreed. I didn't mean they no longer consider adding more muscle to a core when that's the best use of extra transistors. They just no longer have a 20 year roadmap of things to fatten to stay competitive. They abandoned the old ways of thinking and evaluate every single option. One example is the return of Hyper-Threading. At a mere 5% increase in transistor count, it offers up to 30% higher performance.

They still don't speculate much, for better or worse. Not at the level Larrabee does, and most definitely not at the level of an i7.
This may be one area where there can't be a homogenous convergence.

I'm not sure. On the CPU side, Hyper-Threading is helping keeping the pipelines full after a branch. And on the GPU side, not speculating anything means they have to keep all intermediate results in registers and make sure they have plenty of other work to do. For developers it has become increasingly more difficult to avoid stalls when latencies are so high. So even when theoretical efficiency should be excellent, in practice it can decimate due to dependencies. Speculation is not as bad as it sounds, when you have a 99% hit rate and it makes your register file smaller and avoids stalls...

3dilettante · Aug 6, 2009

Nick said:
In absolute transistor counts, sure, but not relative to the transistor budget. A Core i7 measures 263 mm² and has four cores, while the P5 measures 294 mm² and has one core. And I'm pretty sure that i7's branch predictors combined don't take up as much area as a single P5 branch predictor.

It might be close.
Going by rough math (730M/263 for i7 and 3.1/294 for P5), it looks like the i7's density is over 260 times better than that of the P5.

The P5 predictor held 256 entries, and that was the extent of its dynamic branch capabilities.
I've not seen solid numbers on on i7, but RWT postulated 256-512 entries for the first level predictor and anywhere from 2-8K for the second level.
This does not include the possible indirect target buffers, the loop detectors, and a return stack buffer that is also loaded with renaming logic.

For just the second level BTB, the upper bound of that guess with 8K divided by the scale factor ~263 gives us ~31. Which means that an 8K table shrunk from the P5's process would be 31 times smaller.

8K is 32 times larger than 256, the size of the P5's predictor.
The lower end of the estimate is 2K, which is 1/4 the upper bound.
Luckily, we have 4 cores. It's possibly close to even, or possiby four times higher going by the large 2nd level predictor alone.
It doesn't seem likely that Nehalem would lag K8, which already has 2K entries.

It's the same with x86 decoders. They haven't really gotten all that much bigger. And while Core i7 has 16 of them, P5 had only 2 (and Larrabee should have 64).

I don't know of any numbers that would hint at the overall growth like there are for things like BTBs.
I do know that not everything involved in superscalar decoding scales linearly, especially not for variable-length ISAs.
K8 as an older example, has a pre-decode stage needed to resolve instruction lengths in its 16-byte instruction fetch, which involves 16 parallel predecoders for a 3-issue architecture.
In addition, every byte in the instruction cache is accompanied by an additional 3 bits for stored pre-decode information, a near 50% increase in the number of bits per cache line.
I'm not sure P5 needed to bother with this.

i7 also has the additional burden of generating the micro ops that result from the decoders and all the additional bitwork needed for that.
This didn't happen until the PPro, so both P5 and Larrabee are probably slimmer as a result.

I'm not sure. On the CPU side, Hyper-Threading is helping keeping the pipelines full after a branch. And on the GPU side, not speculating anything means they have to keep all intermediate results in registers and make sure they have plenty of other work to do.

SMT isn't the same as speculation, it's going past branches, possible exceptions, and unresolved memory addresses where the chip must reverse or discard calculations that is speculation.

Can you clarify what the exact difference is?
Any register value that is stored in a program is going to be stored by either a non-speculative or speculative processor.
Do you mean that a speculative processor tends to spread this out in time, while a non-speculative one does it all at once?

For developers it has become increasingly more difficult to avoid stalls when latencies are so high. So even when theoretical efficiency should be excellent, in practice it can decimate due to dependencies. Speculation is not as bad as it sounds, when you have a 99% hit rate and it makes your register file smaller and avoids stalls...

As far as speculation goes, for i7, there is a 16-stage pipeline for a 4-wide processor. That means up to 64 instructions can be decoded and issued in a sustained fashion (peak is higher) before a branch is resolved.
The quoted error rate for OoO chips is about 30%. That's 30% of full execution wasted, and 30% of wasted execution that the power-saving logic is fooled into running at peak.

Larrabee is far milder. It might have at most 10 instructions in flight, and the error rate is probably lower because it stalls much more often.
Larrabee also keeps branch prediction in the realm of the x86 side. There are no vector branch predictors, for example.

Wide non-speculative processors have their own problems, particularly with wide vectors. There, the execution is over-determined, but it is usually much clearer to the silicon where it can save power.

Nick · Aug 6, 2009

3dilettante said:
It might be close.

Fair enough. My point is that the extra transistors are not all going to additional 'fat' that doesn't offer proportional benefit. Think about Core 2's 128-bit SIMD units compared to 64-bit for Pentium 4, barrel shifters instead of log shifers, Penryn's radix-16 division unit and super shuffle unit, the many SSE extensions, and last but not least extra cores. All these things have been pretty worthwhile. Just look at Voxilla's renderer and compare it to running at 15 FPS in 320x200 on a P5!

Again, I'm not saying these features don't have any overhead. It's just less substantial than one might think. A 1.7 billion transistor chip based on Core i7, without an L3 cache and with 512-bit AVX, would not trail far behind on Larrabee, and would have some advantages of it's own.

SMT isn't the same as speculation...

That's not what I said. It's quite the opposite.

Like you mentioned, an i7 can issue up to 64 instructions before a branch is resolved. If you had 64-way Hyper-Threading, the CPU could execute an instructions from each thread before getting back to the branch. So there would be no need for speculation. There would be no need for large caches either, as memory latency is hidden as well. Voila, a GPU.

The downside is that you need the register space for all those threads, which affects your ALU density. And scheduling that many threads is not an easy feat either. But not every instruction is a branch, so in practice you could have far less than 64 threads and still not suffer from branch misprediction.

As always, the optimal solution is something in the middle. So I'm quite convinced that a convergence will take place on this front as well.

The quoted error rate for OoO chips is about 30%. That's 30% of full execution wasted, and 30% of wasted execution that the power-saving logic is fooled into running at peak.

That's an interesting percentage, considering that Hyper-Threading offers about a 30% performance increase...

It also differs significantly depening on the type of workload. Graphics doesn't have that many branches, and they are highly predictable.

crystall · Aug 6, 2009

3dilettante said:
As far as speculation goes, for i7, there is a 16-stage pipeline for a 4-wide processor. That means up to 64 instructions can be decoded and issued in a sustained fashion (peak is higher) before a branch is resolved.

Actually it can have more instructions than that in flight; think for example if the branch is waiting on a result from memory, it can happily fill the entire ROB (128 entries IIRC). It can also have many oustanding speculated branches, not just one (can't remember how much though).

3dilettante · Aug 6, 2009

Nick said:
Like you mentioned, an i7 can issue up to 64 instructions before a branch is resolved. If you had 64-way Hyper-Threading, the CPU could execute an instructions from each thread before getting back to the branch. So there would be no need for speculation. There would be no need for large caches either, as memory latency is hidden as well. Voila, a GPU.

That would depend on the types of thread in question, and the threading scheme used by Nehalem will not stall a thread. Any mispredicted instruction slots will take up resources from the other threads.
If the threads are similar in their branchiness, the CPU could easily mispredict equally for all of them the at roughly the same rate, yielding no change.

The downside is that you need the register space for all those threads, which affects your ALU density. And scheduling that many threads is not an easy feat either. But not every instruction is a branch, so in practice you could have far less than 64 threads and still not suffer from branch misprediction.

Wrong-way speculation doesn't just waste time, it wastes resources that could be used doing non-speculative work.
A 64-thread Nehalem would be a waste of time, given how it is architected.
The ROB would be 2 instructions long per thread and since branches don't stall threads, a thread mispredicting a branch will speculate happily for many cycles, which actively blocks other threads from using the active hardware units for something useful.

That's an interesting percentage, considering that Hyper-Threading offers about a 30% performance increase...

I'm not sure we can credit much of it to handling branch mispredictions better.
It's probably more helpful in handling long-latency memory events or unpredictable resource contention.
Some figures for K8 a few years back showed it had an IPC of ~1.
In other words, 2/3 of peak was being wasted for an OoO design.
Hyperthreading probably helps with the remaing 30% lost to other causes.

It also differs significantly depening on the type of workload. Graphics doesn't have that many branches, and they are highly predictable.

Graphics has branches. The mask registers for Larrabee's vector instructions show that x86 designers were willing to live with the P55 predictor for x86 code, but saw no gain in trying to predict branches per component.
They opted for emulating the same thing GPUs did.

crystall said:
Actually it can have more instructions than that in flight; think for example if the branch is waiting on a result from memory, it can happily fill the entire ROB (128 entries IIRC). It can also have many oustanding speculated branches, not just one (can't remember how much though).

That's right. I must have remembered the per-thread count with two threads running.

Nick · Aug 7, 2009

3dilettante said:
A 64-thread Nehalem would be a waste of time, given how it is architected.

I wasn't implying actually creating such a thing, nor sticking to the exact same architecture. I was merely trying to say that SMT can hide branch misprediction penalties by executing non-speculative instructions instead of speculative ones whenever possible. 64 threads per core is just a theoretical extremity that would guarantee no need for any speculative execution, much like GPUs. But the increase in storage for thread contexts makes that option not that interesting. Somewhere in between we'd have an optimal balance between avoiding speculation and keeping the thread count low.

I'm not sure we can credit much of it to handling branch mispredictions better.
It's probably more helpful in handling long-latency memory events or unpredictable resource contention.

It helps with all three situations. When your code has lots of branches, is helps hide branch misprediction penalties, when your code is memory intensive, it helps hide cache misses, and when your code is computationally intensive it helps with dependency chains.

Some figures for K8 a few years back showed it had an IPC of ~1.

That's just true for the average desktop application. Optimized computationally intensive code can have an IPC far closer to the maximum. For proof just look at FQuake: it reaches over 800 mpix/s on a 3.2 GHz quad core, at 80% CPU usage, which is just short of 13 cycles per pixel for doing rasterization, interpolation, perspective correction, textel fetch, bilinear filtering, and writing the result. Clearly the IPC has to be well over 1 to achieve that. And with higher ALU:TEX ratio's things clearly become more computationally intensive.

Also don't forget that GPUs also leave resources idle when they are bottlenecked by arithmetic operations, special functions, texture accesses, ROP, setup, register pressure, bandwidth, etc. So in practice their IPC isn't always at it's maximum either. So for graphics, I don't believe the CPU has any significant disadvantage here.

Hyperthreading probably helps with the remaing 30% lost to other causes.

No. Hyper-Threading is limited by resource contention. The threads are typically very similar. So when one consists of many SSE instructions, the other does too and they're both fighting for the same SSE units. Out-of-order execution simply does not leave that many cycles unused.

Graphics has branches. The mask registers for Larrabee's vector instructions show that x86 designers were willing to live with the P55 predictor for x86 code, but saw no gain in trying to predict branches per component.
They opted for emulating the same thing GPUs did.

Why would it be any different with a CPU?

3dilettante · Aug 7, 2009

Nick said:
I wasn't implying actually creating such a thing, nor sticking to the exact same architecture. I was merely trying to say that SMT can hide branch misprediction penalties by executing non-speculative instructions instead of speculative ones whenever possible.

That's dependent on the threading mix. If one thread is extremely well-behaved and the other has mispredicted branches, then yes, some share of the mispredicted instructions will be displaced long enough that the branch will resolve, possibly. It depends on the mix, and on Nehalem's secret sauce of heuristics.
If both threads are branchy, then both will experience similar rates of misprediction.
Is there a factor I am missing that will somehow lower the total amount of wasted work if both threads mispredict 30% of the time?

64 threads per core is just a theoretical extremity that would guarantee no need for any speculative execution, much like GPUs. But the increase in storage for thread contexts makes that option not that interesting. Somewhere in between we'd have an optimal balance between avoiding speculation and keeping the thread count low.

SRAM is cheap in the face of other constraints.
A compromise chip is not going to be impressive for a low-parallelism case, and not for a high-parallelism case.

It helps with all three situations. When your code has lots of branches, is helps hide branch misprediction penalties,

How does it do this in the general case? i7 is pretty aggressive, as it speculates past branches and possible memory aliasing, so pick any two random threads and there's a good chance they'll both have a healthy share of speculative instructions issued.

That's just true for the average desktop application. Optimized computationally intensive code can have an IPC far closer to the maximum.

And the performance benefits of hyperthreading are between a rare 50 to 0 to occassionally negative percent.
Intel's 30% quote is an average over what its processors are likely to run, and it would seem to follow that their average desktop processors will run the average desktop application.

For proof just look at FQuake: it reaches over 800 mpix/s on a 3.2 GHz quad core, at 80% CPU usage, which is just short of 13 cycles per pixel for doing rasterization, interpolation, perspective correction, textel fetch, bilinear filtering, and writing the result. Clearly the IPC has to be well over 1 to achieve that. And with higher ALU:TEX ratio's things clearly become more computationally intensive.

I don't have a profile for what FQuake spends its time on.
Your math may not be accurate because it's not clear where that 80% number came from.

Also don't forget that GPUs also leave resources idle when they are bottlenecked by arithmetic operations, special functions, texture accesses, ROP, setup, register pressure, bandwidth, etc. So in practice their IPC isn't always at it's maximum either. So for graphics, I don't believe the CPU has any significant disadvantage here.

The difference here is the difference between idling and wrongly speculating.
If a ROP is idle, the clock gating circuitry knows it is idle.
If a FP pipe in a speculating processor is running an instruction past an unresolved branch, the clock gating circuitry cannot know it is idle.

No. Hyper-Threading is limited by resource contention. The threads are typically very similar. So when one consists of many SSE instructions, the other does too and they're both fighting for the same SSE units. Out-of-order execution simply does not leave that many cycles unused.

And that is a situation where SMT yields less than 30% improvement.

Why would it be any different with a CPU?

Because CPUs that need to get decent straight-line performance in the general case on a wide range of code speculate, and the best ones speculate a lot.
Their workload involves creating parallelism where none is known to exist.

Larrabee's case and GPUs even more so is a case where such parallelism is required to exist.

Nick · Aug 8, 2009

3dilettante said:
Is there a factor I am missing that will somehow lower the total amount of wasted work if both threads mispredict 30% of the time?

It mispredicts 1% of the time. The 30% comes from running a mispredicted branch for the length of the pipeline, for a single-threaded core.

Anyway, what you appear to be missing is that the misprediction penalty is lower with SMT because instructions are not fetched that far ahead, in each thread, while the total number of mispredictions stays the same. Speculation can even be zero: Let's say you have two threads where each branch is at least 64 instructions apart. Then whenever a branch is encountered, the CPU can switch to the other thread to avoid executing speculative instructions. By the time it reaches a branch in the second thread, the branch of the first thread is resolved so it can switch back and still not execute speculative instructions. You can 't have that with a single thread (except of course if there are no branches whatsoever).

With 4-way SMT the average branch density has to be 3 every 64 instructions to avoid any speculation. With 8-way it has to be 7 every 64 instructions. Etc. Code where branch density is about 1 every 8 instructions is already considered quite branch intensive. You'd be hard-pressed to find that in computationally intensive code like graphics.

Also, because the IPC isn't constantly 4, the situation is actually even better. So my expectation is that 4-way SMT should be sufficient to make branch misprediction no longer a significant issue. This is confirmed by research. There is no need to switch threads on every cycle, like GPUs do. That's a waste of context storage.

SRAM is cheap in the face of other constraints.

I bet that's what Intel was thinking too when they equipped the 486 with 8 kB of L1 cache. Today a smaller cache is the primary way to reduce cost...

The register space of GPUs has grown ever since we started calling them GPUs. If you want to run generic code, it has to increase even further. Developers go to great lengths to ensure that their algorithms run in a single 'pass' to avoid wasting precious bandwidth. So more intermediate data has to stay on chip. Like I said before, running out of register space is disastrous for performance.

But the intermediate data only has to stay on chip for as long as the final result isn't ready. A GPU takes a comparatively long time to finish computing. Even if the next instruction in a thread could start the very next cycle, a GPU will first execute the current instruction for every other thread. So it has to keep a lot more data around. The solution is to adopt more CPU-like features...

Your math may not be accurate because it's not clear where that 80% number came from.

FQuake is not using the remaining 20% because of dynamic load balancing granularity and because primitive processing is single-threaded. So I think my math is accurate. Computationally intensive code reaches an IPC well above 1 on modern architectures.

The difference here is the difference between idling and wrongly speculating.

So you rather have a chip that is twice as large and idle half the time, than a chip that has high utilization but uses a bit of speculation?

Current CPU architectures are not optimal, but neither are current GPU architectures. Both have something to learn from each other. Larrabee's got most of it right, but it's not the end of the convergence.

And that is a situation where SMT yields less than 30% improvement.

First you claim IPC is around 1, and now you're saying it's so high SMT can't yield 30% improvement?

Because CPUs that need to get decent straight-line performance in the general case on a wide range of code speculate, and the best ones speculate a lot.
Their workload involves creating parallelism where none is known to exist.

I thought we were talking about Larrabee's use of mask registers...

Never mind. Yes, CPUs speculate to get good single-threaded performance, creating parallelism where there is none known for certain. But future GPUs will gradually start facing the same kind of workload. Developer's already cringe at the idea of having to split up workloads over 8 threads to get high performance out of a Core i7. So how could they possibly consider running things on a GPU with hundreds of threads, other than graphics? Scheduling long running tasks is not efficient and a lot of them don't work on parallel data. Larrabee will be far better at running a wide variety of workloads, and CPUs are increasing the number of cores and widen the SIMD units to catch up...

Jawed · Aug 8, 2009

Nick said:
Also, because the IPC isn't constantly 4, the situation is actually even better. So my expectation is that 4-way SMT should be sufficient to make branch misprediction no longer a significant issue. This is confirmed by research. There is no need to switch threads on every cycle, like GPUs do. That's a waste of context storage.

GPUs only switch threads when they have to.

A pair of threads in ATI can run for a maximum of 32 instructions (each of these two threads could have a distinct set of 32 instructions) - you can consider the pipeline to be two-way SMT for those 32 instructions, running thread A instruction 1 and thread B instruction 1 as AAAABBBB. Those two threads can, therefore, execute for a total of 256 cycles before any kind of switch takes place. Though as it happens, the way the register file is constructed, the thread switch, when it comes, is free as long as there are at least N total threads on the SIMD. I don't know what N is. It could be 4 or 6, dunno.

The fastest that ATI can switch is once every 8 cycles, in effect. NVidia appears to be either 4 or 8 cycles, unless vertex shader batches are still 16-vertices wide, which would imply a possibility of it being 2 cycles (but I think warp-pairing effectively makes it 4 in this case).

The register space of GPUs has grown ever since we started calling them GPUs. If you want to run generic code, it has to increase even further. Developers go to great lengths to ensure that their algorithms run in a single 'pass' to avoid wasting precious bandwidth. So more intermediate data has to stay on chip. Like I said before, running out of register space is disastrous for performance.

The ATI pipeline contains a small register file (160 bits for 128 elements = 20KB) - it might be better to think of the 256KB of register file per SIMD as a fast context buffer with no built-in automation for switching data to/from memory, like you'd get with a cache. Switches have to be done explicitly, which basically means they're hard-compiled. RV740 introcued 16KB burst fetches directly from memory to register file - but without a symmetric burst store, it's not a complete solution to produce fast switches amongst subsets of context data.

Actually the 256KB of ATI register file might be better thought of as 16x 16KB register files - since, for example, register file 13 cannot be accessed by any ALU group (x,y,z,w,t ALUs) other than 13.

Theoretically it's possible for GPUs to hide the latency of switching subsets of context to/from memory - it's analogous to hiding the latency of texturing. There's been scant research on the existing bottlenecks in this scenario and whether the GPU designs are currently hurting themselves here, e.g. by making the context so large (on ATI that's 128 vec4s per pixel, i.e. 2KB per pixel or 128KB total) that only 2 threads are in the register file - at which point swapping to memory can't easily benefit from latency hiding as 2 threads can't hide much latency unless there's a lot of non-switch-dependent ALU instructions available to run.

But the intermediate data only has to stay on chip for as long as the final result isn't ready. A GPU takes a comparatively long time to finish computing. Even if the next instruction in a thread could start the very next cycle, a GPU will first execute the current instruction for every other thread. So it has to keep a lot more data around. The solution is to adopt more CPU-like features...

It's interesting that OTOY reckons that a single GPU can run 10 instances of a game like Crysis when the GPU is used in a cloud rendering situation, rather than the GPU running the same game dedicated in the user's own PC. It indicates that GPU capabilities are being vastly wasted.

http://raytracey.blogspot.com/2009/06/2-new-otoy-videos.html

Jawed

prunedtree · Aug 8, 2009

Hello Jawed.
My (limited) understanding of the R700 architecture prompts me to post.

Jawed said:
A pair of threads in ATI can run for a maximum of 32 instructions (each of these two threads could have a distinct set of 32 instructions)

According to the ISA documentation and typical compiler assembly outputs, the limitation is 128 instruction slots for ALU clauses (and 8 slots for TEX/VTX clauses).
This means from 26 to 128 instruction groups depending on VLIW issue.

Jawed said:
The ATI pipeline contains a small register file (160 bits for 128 elements = 20KB)

According to documentation and practical experience again, the register file contains 256 registers, each 128-bit wide, for a total of 4 KB per lane, thus indeed 256 KB per SIMD (64 lanes) and 2.56 MB on chip.

Jawed said:
Actually the 256KB of ATI register file might be better thought of as 16x 16KB register files - since, for example, register file 13 cannot be accessed by any ALU group (x,y,z,w,t ALUs) other than 13.

I believe you mean 64x 4KB. `wraps' are 64-lanes wide, and each lane has a dedicated chunk of the register file.

Nick · Aug 8, 2009

Jawed said:
GPUs only switch threads when they have to.

I think there might be some confusion in terminology here.

I was talking about threads in the context of SMT: switching to executing the instructions for a different, independent data element (e.g. pixels). I believe this corresponds to CUDA's definition of a thread. A GPU's threads simply execute in lock-step using the same instruction stream. So what I tried to say is that this is similar to SMT where each thread executes one instruction and then switches to the next thread. The problem with that approach is that you need to store a lot of thread contexts. Each thread is executed at a slow pace: every instruction has the latency equal to the length of the pipeline, even if the next instruction of the same thread is really independent and could have started the next cycle...

It's interesting that ATI suffers less from it than NVIDIA, because in a single thread (still using the above terminology) it executes vector instructions, while NVIDIA's architecture executes scalar instructions. The latter also has a higher clock frequency so I expect a longer pipeline. All this leads me to believe that GT200 is much larger than RV790 largely due to register space. Or am I way off?

It's interesting that OTOY reckons that a single GPU can run 10 instances of a game like Crysis when the GPU is used in a cloud rendering situation, rather than the GPU running the same game dedicated in the user's own PC. It indicates that GPU capabilities are being vastly wasted.

That's really interesting indeed!

Several years ago CPU designers faced the end of the MHz-race, and were forced to look at other techniques to increase performance (mainly thread-level parallelism). Maybe the time has come for GPU designers to realize that they can't keep adding more ALUs in parallel, and they should start looking at ways to truely process things faster instead of just trying to process more. One way to do that would be to let cores not execute AAAABBBB but ABABCDCD. Later this static scheduling could even be extended to a form of dynamic out-of-order execution, possibly even with speculation.

Voxilla · Aug 9, 2009

Jawed said:
It's interesting that OTOY reckons that a single GPU can run 10 instances of a game like Crysis when the GPU is used in a cloud rendering situation, rather than the GPU running the same game dedicated in the user's own PC. It indicates that GPU capabilities are being vastly wasted.

OTOY should add you need a GPU with 4 GB to do this and these cost 10 times more compared to consumer versions.
Server side rendering works, no doubt, I've been working on a render server and it works pretty well even with server and client 100 km apart using plain broadband 3 mbit/s internet connection with mpeg4 streaming.
I've just been upgraded to VDSL 2 / 20 mbit/s broadband over twisted pair and got HD television in the process all for free, competition is great.

With this amount of bandwidth 2 HD 720p video streams can be received at once, so plenty enough for server side game or other rendering.

Voxilla · Aug 9, 2009

Nick said:
All this leads me to believe that GT200 is much larger than RV790 largely due to register space. Or am I way off?

If you look at the GT200 die only about 1/4 of the space is taken by the shader processors. ROP and frame buffer logic take also about 1/4 of the space. Texture units a little less than 1/4, the rest is taken by rasterizer, thread scheduler and memory interface.
With RV790, shader processors take about 1/3, ROP and frame buffer 1/3, texture units less than 1/6 the rest is memory interface etc.
All of this just to give an indication that shader logic is not all that dominant.
Register space is like 30x64 KB on GT200 and 10*256KB on RV790, so there is actually less register space on GT200. The R790 is just a very efficient design, the R600 was rather poor, so Nvidia may have some room for improvement with their next design.

Fast software renderer

Similar threads