Fast software renderer

Jawed · Aug 9, 2009

prunedtree said:
According to the ISA documentation and typical compiler assembly outputs, the limitation is 128 instruction slots for ALU clauses (and 8 slots for TEX/VTX clauses).
This means from 26 to 128 instruction groups depending on VLIW issue.

An ALU clause can contain a maximum of 128 X,Y,Z,W,T slots, but a maximum of 32 VLIW instructions (instruction groups).

According to documentation and practical experience again, the register file contains 256 registers, each 128-bit wide, for a total of 4 KB per lane, thus indeed 256 KB per SIMD (64 lanes) and 2.56 MB on chip.

I think it's merely a timing constraint in the operation of the register file circuitry that restricts a thread to being able to access only a maximum of 128 registers, in effect. Put another way, you could say it's 32 register files, 2 per ALU lane group.

I believe you mean 64x 4KB. `wraps' are 64-lanes wide, and each lane has a dedicated chunk of the register file.

It's the ALU structure that dominates in the physical organisation, see slides 11 and 12:

http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

The wavefronts' allocations of registers are merely striped within each of the 16 register files, all the stripings being the same.

So they are sort of a hybrid of 16 separate blocks of memory and sort of 1 block of memory. It's a matter of addressing/data lines.

SR registers have restrictions on their access by the odd and even wavefronts, i.e. a stall in one of them while a write completes from the other.

Jawed

Voxilla · Aug 9, 2009

Jawed said:
Theoretically it's possible for GPUs to hide the latency of switching subsets of context to/from memory - it's analogous to hiding the latency of texturing. There's been scant research on the existing bottlenecks in this scenario and whether the GPU designs are currently hurting themselves here, e.g. by making the context so large (on ATI that's 128 vec4s per pixel, i.e. 2KB per pixel or 128KB total) that only 2 threads are in the register file - at which point swapping to memory can't easily benefit from latency hiding as 2 threads can't hide much latency unless there's a lot of non-switch-dependent ALU instructions available to run.

What is needed in my opinion to keep GPUs evolving is a cached stack. More complex shaders need a higher maximum number of registers. Currently with more needed registers the number of SMT threads has to be reduced to a point where the latency hiding brakes down and so does performance. With a stack registers could be dumped reducing the number of simultaneously needed registers, and hence keep the SMT thread count high enough so not to impair latency hiding.
Such a stack would also open the possibility for true functions calls, recursion etc and open up a host of new algorithms that are now impossible or hard to implement.

Jawed · Aug 9, 2009

Nick said:
I think there might be some confusion in terminology here. I was talking about threads in the context of SMT: switching to executing the instructions for a different, independent data element (e.g. pixels). I believe this corresponds to CUDA's definition of a thread.

CUDA's definition of a thread is an abomination and I curse NVidia and AMD almost as much as Aaron does for the misuse.

A GPU's threads simply execute in lock-step using the same instruction stream. So what I tried to say is that this is similar to SMT where each thread executes one instruction and then switches to the next thread. The problem with that approach is that you need to store a lot of thread contexts. Each thread is executed at a slow pace: every instruction has the latency equal to the length of the pipeline, even if the next instruction of the same thread is really independent and could have started the next cycle...

I agree with all that, I just wanted to home in on the precise nature of the misapprehension you had with regard to threads and wastage. It's not a waste of context storage, any more than having sharable resources within a CPU to support SMT and out-of-order execution. To call it a waste is fundamentally missing the point.

It's interesting that ATI suffers less from it than NVIDIA, because in a single thread (still using the above terminology) it executes vector instructions, while NVIDIA's architecture executes scalar instructions. The latter also has a higher clock frequency so I expect a longer pipeline. All this leads me to believe that GT200 is much larger than RV790 largely due to register space. Or am I way off?

NVidia has less total, logical, register file than ATI, but only marginally, ~1.9MB versus ~2.6MB.

Overall NVidia's design has lower "SIMDness" in the ALUs and a funky operand collector and instruction scoreboarder that are massively complex in comparison with the ATI approach. Every instruction issue requires both dynamic instruction dependency analysis and dynamic operand readiness analysis, while ATI does neither. It seems to me that the support logic for the SIMDs in NVidia's design is considerably more expensive.

NVidia's design in GT200 has clear benefits - the most complete double-precision support and a reduced branching incoherence penalty associated with 32 strands, versus 64 in ATI. Outside of the ALUs the only other function point I'm aware of is atomics in video memory.

That's really interesting indeed!

I'd say it's worrying. It also seems to imply that Larrabee will easily beat the coming ATI and NVidia GPUs, since they're wasting the majority of their time with a hugely complicated pipeline that's terribly mismatched to the API. Or, put another way, the API's virtual pipeline is drowning in a sea of stalls, bottlenecks and flushes

I'd still like to see, with my own eyes, that this assertion is true.

Several years ago CPU designers faced the end of the MHz-race, and were forced to look at other techniques to increase performance (mainly thread-level parallelism). Maybe the time has come for GPU designers to realize that they can't keep adding more ALUs in parallel, and they should start looking at ways to truely process things faster instead of just trying to process more. One way to do that would be to let cores not execute AAAABBBB but ABABCDCD. Later this static scheduling could even be extended to a form of dynamic out-of-order execution, possibly even with speculation.

Well the ATI model is really easy to understand. There are no cycles lost to pipeline flush due to incorrect branching. That's because branching is evaluated by the Sequencer and a clause of ALU instructions indivisibly executes only if the Sequencer says so. Branching becomes costly if the latency of thread switching can't be hidden by the population of threads. I don't know the numbers, but say at least 4 threads are required to hide a branch. Like you were doing earlier, you can count the number of branches per instruction cycle to determine if branch-thread-switch latency will overwhelm the GPU, much like an assessment of ALU:TEX ratio. So you end up totalling all sources of latency for the duration of a kernel, texturing, branching, reading/writing shared memory, gather/scatter to assess that against the ALU cycle count.

NVidia does do flushes, because branch evaluation takes time...

Of course, cycles are lost due to SIMD incoherence penalties.

GPU designers are aiming at data-parallel problems. The crunch comes in trading-off SIMD incoherence penalties against data-parallel algorithms that are unavoidably branchy, ray-tracing being something of a poster child

I can't wait to see what Larrabee does with 16-wide fibres. I also want to see if anyone does "dynamic warp formation" in the hardware. Once GPUs can provide arbitrary queues between arbitrary kernels, then this should fall out.

As for out-of-order execution, NVidia's doing it already. That's why the multiprocessors are already so big.

Jawed

Blazkowicz · Aug 9, 2009

Voxilla said:
OTOY should add you need a GPU with 4 GB to do this and these cost 10 times more compared to consumer versions.
Server side rendering works, no doubt, I've been working on a render server and it works pretty well even with server and client 100 km apart using plain broadband 3 mbit/s internet connection with mpeg4 streaming.
I've just been upgraded to VDSL 2 / 20 mbit/s broadband over twisted pair and got HD television in the process all for free, competition is great.
With this amount of bandwidth 2 HD 720p video streams can be received at once, so plenty enough for server side game or other rendering.

what kind of set-up do you use?
My goal would be to run 10 instances of Warcraft III (but 4 would be a good start), played on a local network with thin clients. (the usual map/mod played is DotA. The game runs almost flawless on Wine)
Here a 512MB card would be fine, I won't mind a 800x600 16bit resolution (1024 is ideal)

A good bang-for-the-buck card should be the 1792MB GTX260, there's one from Gainward at 200€.
I suppose you could use two or more such cards.
It would be good to avoid needless texture replication, I wonder which elements have to be aware for that to work (OS, driver, middleware, game)

Server-side rendering is an awesome prospect. Games or apps have an extremely high amplitude of perf requirements, with the generational gaps but also gaps between mobile GPU, IGP and discrete GPU. Sharing a GPU can better allow to actually use its power.
I would also like to plug some free computer junk with CRT on, network-boot it and instantly add a lan-gaming capable computer. just like we do for 2D-only tasks. I'm typing this on toxic computer garbage turned into a terminal for a Q6600 w/ 4GB ram, thanks to ubuntu and LTSP.

Jawed · Aug 9, 2009

Voxilla said:
What is needed in my opinion to keep GPUs evolving is a cached stack. More complex shaders need a higher maximum number of registers. Currently with more needed registers the number of SMT threads has to be reduced to a point where the latency hiding brakes down and so does performance. With a stack registers could be dumped reducing the number of simultaneously needed registers, and hence keep the SMT thread count high enough so not to impair latency hiding.
Such a stack would also open the possibility for true functions calls, recursion etc and open up a host of new algorithms that are now impossible or hard to implement.

Yeah, if a kernel has high ALU:TEX then you can spend some of the latency-headroom on spilling. You could even do it routinely and always have, for example, a minimum of 16 threads.

Register spills (both the writes and the reads, later) are coherent, so they won't generate huge amounts of latency. Of course you can burn through latency-headroom quickly. It's a pretty tricky compiler problem. I was looking at the puzzle solver recently:

http://compilers.cs.ucla.edu/fernando/projects/puzzles/tutorial/

which seems to be good at avoid spilling, but well, once you need to spill...

Jawed

Voxilla · Aug 9, 2009

Blazkowicz said:
what kind of set-up do you use?
My goal would be to run 10 instances of Warcraft III (but 4 would be a good start), played on a local network with thin clients. (the usual map/mod played is DotA. The game runs almost flawless on Wine)
Here a 512MB card would be fine, I won't mind a 800x600 16bit resolution (1024 is ideal)

The idea was running Crysis on a server. As each player can be in a different level, you need to run 10 instances. So graphics memory requirements are similar to those of one instance times 10.
Theoretically you can do with less memory if all players are in the same level as you suggest. Those games however would need to be modified as none are conceived to run on a server with shared rendering resources among players as far as I know.

Nick · Aug 9, 2009

Voxilla said:
All of this just to give an indication that shader logic is not all that dominant.

You're right, the schader cores can't be the only reason why they differ so much in size while being comparable in performance. Looks like ATI also saved heavily on texture units and ROPs. I still wonder though if the reason why they achieve higher GFLOPS on a smaller area is the savings on the register files thanks to superscalar pipelines.

Register space is like 30x64 KB on GT200 and 10*256KB on RV790, so there is actually less register space on GT200.

Well, sure, but it has 3 times more registers per ALU! I was wrong that this is the main reason why GT200 is much fatter than RV790, but if they had the same GFLOPS the register file would obviously take up a significant area. Also, GT200 has 30x16 kB of local storage.

It would be interesting to know which of the two architectures is experiencing the highest register pressure...

The R790 is just a very efficient design, the R600 was rather poor, so Nvidia may have some room for improvement with their next design.

Yeah, I'm looking forward to GT300, which is rumored to have MIMD cores. G80, G92 and GT200 are basically the same thing, while GT300 should bring something new to the table. I hope ATI doesn't make the mistake of merely adding more units to try to widen any bottlenecks.

Voxilla · Aug 9, 2009

Nick said:
Looks like ATI also saved heavily on texture units and ROPs. I still wonder though if the reason why they achieve higher GFLOPS on a smaller area is the savings on the register files thanks to superscalar pipelines.

ATI has higher theoretical GFLOPS, Nvidia may have higher usefull GFLOPS.

Nick said:
Well, sure, but it has 3 times more registers per ALU! I was wrong that this is the main reason why GT200 is much fatter than RV790, but if they had the same GFLOPS the register file would obviously take up a significant area. Also, GT200 has 30x16 kB of local storage.
It would be interesting to know which of the two architectures is experiencing the highest register pressure...

2 MB of register file should be around 100 million transistors or roughly 1/10 die size.
If my calculations are correct the GT200 has 8KB registers per ALU and the RV790/770 16 KB per 5 ALUs. These 1 NV and 5 ATI ALUs execute one thread, so I would think the RV790 has twice the register budget.

Jawed · Aug 10, 2009

Nick said:
Also, GT200 has 30x16 kB of local storage.

While RV770 has 10x16KB.

One way to think about register pressure is to consider the maximum amount of thread state possible, while the ALUs hide their own systematic latency. In GT200 6 threads per multiprocessor are required to hide latency (register read after write), in RV770 4 threads (clause switch).

So per thread:

GT200 - 64KB / (32 strands * 6 threads) = 333
RV770 - 256KB / (64 strands * 4 threads) = 1024

On NVidia it's possible to get away with only 2 threads, but the code must be compiled with no serial instruction dependencies within 3 instructions of each other, to maintain full throughput. One thread will work on NVidia, to obtain 2KB, but I'm doubtful that will use more than half the ALU cycles available. One thread on ATI will only use half the ALU cycles (with additional clause switch overhead) and with a fair bit of kludging by using normal registers (1KB) and strand-shared registers (2KB) it would be possible to get 3KB.

One thing I've realised is that ATI's compiler, per clause, determines up to 4 vec4 registers can be discarded - these are the clause temporaries. This is part way to the register spill solution, but I suspect it's more of a lazy evaluation than anything, i.e. the compiler doesn't shift clause boundaries to tweak this and minimise register allocation, it just identifies the registers it can discard for the clauses it constructs.

Jawed

Nick · Aug 10, 2009

Jawed said:
CUDA's definition of a thread is an abomination and I curse NVidia and AMD almost as much as Aaron does for the misuse.

It's not an abomination. NVIDIA simply looks at it from the software developer's point of view. When you're writing a shader, you describe how a single pixel or vertex should be processed. How the hardware parallelizes it is irrelevant. It could be GT200 processing one scalar from 32 threads at a time, it could be RV790 executing 64 threads in parallel, it could be a single-core CPU processing one element at a time in a loop, etc. It's the hardware (and its driver) that aggregates the threads written by the developer into larger threads that process elements in parallel, or not.

Don't get me wrong, I fully understand why from a hardware point of view it's considered an abomination. It's the aggregated threads that form a single instruction stream and it's the most basic form of code a GPU can process. So naturally every archtecture has its own definition of a 'hadware' thread.

There just doesn't seem to be any consensus that a thread is supposed to signify a software thread or a hardware thread. On a CPU they coincide so for a long time there was no reason for discussion. So it seems to me that NVIDIA was entirely in its own right to make it signify a software thread.

Looking at Larrabee, it's even possible to have 'multi-threaded shaders' if you want. So even a single shader doesn't have to be a single thread. They made the mistake of explicitly processing qquads in a single thread though. That might very well change in the future...

I agree with all that, I just wanted to home in on the precise nature of the misapprehension you had with regard to threads and wastage. It's not a waste of context storage, any more than having sharable resources within a CPU to support SMT and out-of-order execution. To call it a waste is fundamentally missing the point.

I'm not saying it's a waste per se. It's obvious you need extra registers to run more threads. I'm just saying it's a waste when you're hiding latency that isn't there. In that case you're keeping data on the chip for longer than is necessary. GPUs do this by not always issuing independent instructions in the next cycle. ATI compensates this by processing independent vector components in parallel.

As we move forward, simply processing more threads doesn't seem the best option. Nobody likes extra threads.

Overall NVidia's design has lower "SIMDness" in the ALUs and a funky operand collector and instruction scoreboarder that are massively complex in comparison with the ATI approach. Every instruction issue requires both dynamic instruction dependency analysis and dynamic operand readiness analysis, while ATI does neither. It seems to me that the support logic for the SIMDs in NVidia's design is considerably more expensive.

I wasn't aware of that. Thanks for the info!

I'd say it's worrying. It also seems to imply that Larrabee will easily beat the coming ATI and NVidia GPUs, since they're wasting the majority of their time with a hugely complicated pipeline that's terribly mismatched to the API. Or, put another way, the API's virtual pipeline is drowning in a sea of stalls, bottlenecks and flushes

What do you mean they're terribly mismatched to the API? Are you talking about shader models or the entire graphics API? What makes you think Larrabee matches the API so much better?

I don't think we know anything definite about future GPUs yet, do we? The architecture of G80 was a complete surprise, and it looks like G300 will be something entirely new as well. ATI may manage to stay highly competitive by scaling and tweaking the current architecture.

Larrabee's scaling opportunities seem limited. And its entire succes as a GPU depends on the software...

Well the ATI model is really easy to understand. There are no cycles lost to pipeline flush due to incorrect branching. That's because branching is evaluated by the Sequencer and a clause of ALU instructions indivisibly executes only if the Sequencer says so. Branching becomes costly if the latency of thread switching can't be hidden by the population of threads. I don't know the numbers, but say at least 4 threads are required to hide a branch. Like you were doing earlier, you can count the number of branches per instruction cycle to determine if branch-thread-switch latency will overwhelm the GPU, much like an assessment of ALU:TEX ratio. So you end up totalling all sources of latency for the duration of a kernel, texturing, branching, reading/writing shared memory, gather/scatter to assess that against the ALU cycle count.

Thanks, that clarifies a few more things I was curious about.

Rolf N · Aug 10, 2009

Saying "thread" whenever you mean plain old data is confusing without adding any precision to the conversation. It's elitist, and a deliberate obfuscation of something quite simple to make it appear more wondrous.

Whenever a GPU mixes instructions from multiple sources, I'd be more than happy to call that threading. Data I'll just keep calling data (or maybe fragment, vertex etc).

Nick · Aug 10, 2009

Rolf N said:
Saying "thread" whenever you mean plain old data is confusing without adding any precision to the conversation. It's elitist, and a deliberate obfuscation of something quite simple to make it appear more wondrous.

Whenever a GPU mixes instructions from multiple sources, I'd be more than happy to call that threading. Data I'll just keep calling data (or maybe fragment, vertex etc).

I never intended to deliberately obfuscate anything, let alone try to sound elitisch. If I'm using the wrong terminology it's merely out of ignorance, and I won't apologise for that. But as far as I can tell there isn't even a lot of consensus between different hardware manufacturers: CUDA, Larrabee, SPARC...

From my perspective thread instructions and thread data can't be separated. If 32 pixels are processed in parallel I don't call that a single thread any more. Even though they share the same instructions, the software developer merely wrote a shader describing the processing of a single pixel, and spawns many instances of it. He shouldn't have to care whether the processor runs them in lock-step or uses multiple instruction pointers, or uses an unrolled loop where out-of-order execution identifies the independent instructions across iterations, or a combination of all of this...

It's not confusing because I intended to make it confusing. It's confusing because it's complex and there's a lack of consensus between the hardware and software terminology. At least that's how I see things. If you can point me to a document that defines everything in a way everybody can agree with, I'd be more than happy to use that terminology.

3dilettante · Aug 10, 2009

Nick said:
It mispredicts 1% of the time. The 30% comes from running a mispredicted branch for the length of the pipeline, for a single-threaded core.

That was my bad wording. It should have stated that 30% of the work is wasted in both threads, which I interpret as meaning that on average 1/3 of all ROB entries are not committed.

Anyway, what you appear to be missing is that the misprediction penalty is lower with SMT because instructions are not fetched that far ahead, in each thread, while the total number of mispredictions stays the same.

Nehalem still fetches pretty far ahead, up to 64 ahead in SMT mode. The decode rate would be about half per thread, which I can see providing some benefit in buying time for a branch with a small window of time needed for resolution.

Speculation can even be zero: Let's say you have two threads where each branch is at least 64 instructions apart.

That seems like a pretty restrictive example, and not one a silicon designer can count on.

Then whenever a branch is encountered, the CPU can switch to the other thread to avoid executing speculative instructions.

Which processors with SMT do that?
It's an awfully GPU-like thing to switch on a branch, if we also set aside for a moment that GPUs from the POV of the warp or wavefront abstraction actually switch in sets of 32 or 64 branches.
Larrabee's raster method will branch in granularities of at least 16 to start with (or rather, the fiber pretends to).

Switching right there also negates the point of branch prediction, since putting a thread to sleep right there means halting instruction fetch, which is why we're predicting branches in the first place.

Also, because the IPC isn't constantly 4, the situation is actually even better. So my expectation is that 4-way SMT should be sufficient to make branch misprediction no longer a significant issue. This is confirmed by research. There is no need to switch threads on every cycle, like GPUs do. That's a waste of context storage.

The paper indicates that SMT leaves the overall CPI insenstive to branch prediction accuracy (much of the figures are normalized to a YAGS scheme).
The discusion of a branch misprediction as a long-latency event was not particularly in-depth and their treatment of the components of that penalty is unclear to me.
It's not particularly helpful in teasing out information on my point of emphasis: the amount of work an aggressively speculating OoO chip wastes.
I need baseline numbers not starting with an SMT chip.
Going by the general observation of 30% overall wastage for a OoO chip (probably an oldish figure, but likely close), how much lower is wastage in SMT?
Relative insensitivity is not interesting to me, if it means little change from an already wasteful baseline.

The register space of GPUs has grown ever since we started calling them GPUs. If you want to run generic code, it has to increase even further. Developers go to great lengths to ensure that their algorithms run in a single 'pass' to avoid wasting precious bandwidth. So more intermediate data has to stay on chip. Like I said before, running out of register space is disastrous for performance.

Register pressure is a constant threat, in some cases of high occupancy, we might see a miniscule 8 registers per thread, which CPU coders will tell you is a disaster...

But the intermediate data only has to stay on chip for as long as the final result isn't ready. A GPU takes a comparatively long time to finish computing. Even if the next instruction in a thread could start the very next cycle, a GPU will first execute the current instruction for every other thread. So it has to keep a lot more data around. The solution is to adopt more CPU-like features...

There are problems with the way GPUs work, but I don't see anything wrong just by having a lot of data on chip. It's actually preferable in many ways.

FQuake is not using the remaining 20% because of dynamic load balancing granularity and because primitive processing is single-threaded. So I think my math is accurate. Computationally intensive code reaches an IPC well above 1 on modern architectures.

That doesn't help the rest of the code that needs to be run, which silicon has no choice but to also cater to.

So you rather have a chip that is twice as large and idle half the time, than a chip that has high utilization but uses a bit of speculation?

All else being equal, no.
The chip that's twice as large can handle an order of magnitude more peak resources, so it's not all equal.
A little speculation on 10 times as many units is a lot more expensive.
Utilization of total resources without knowing the total is not enough to make a judgement call.

Current CPU architectures are not optimal, but neither are current GPU architectures. Both have something to learn from each other. Larrabee's got most of it right, but it's not the end of the convergence.

Larrabee is a garbage desktop CPU. Assuming it hits 2.0 GHz, to most consumer applications it will appear as a 250 MHz Pentium.
One core and 1/4 threading for single-threaded apps, and single-issue for everything not using its vector instructions.

First you claim IPC is around 1, and now you're saying it's so high SMT can't yield 30% improvement?

Yes, if resource contention is a problem or there are no instruction slots unused, SMT can't manufacture extra performance.

Developer's already cringe at the idea of having to split up workloads over 8 threads to get high performance out of a Core i7. So how could they possibly consider running things on a GPU with hundreds of threads, other than graphics?

I'm not sure they'd want to.
I don't particularly care if they don't apply it to anything other than ridiculously parallel tasks.

Scheduling long running tasks is not efficient and a lot of them don't work on parallel data. Larrabee will be far better at running a wide variety of workloads, and CPUs are increasing the number of cores and widen the SIMD units to catch up...

Larrabee will be far better than GPUs at certain loads, yes.
Better than crap isn't necessarily good.

Nick · Aug 11, 2009

3dilettante said:
That was my bad wording. It should have stated that 30% of the work is wasted in both threads, which I interpret as meaning that on average 1/3 of all ROB entries are not committed.

No. With Hyper-Threading, the amount of total uncommitted instructions is lower. In fact I was unable to concoct any experiment on my Core i7 with unpredictable branches where 8 threads didn't run 30% faster (relatively) than 4 threads.

That seems like a pretty restrictive example, and not one a silicon designer can count on.

It was only a hypothetical example to show that 2-way SMT can eliminate branch speculation. The point is that even with more realistic branch densities it has a positive effect. And that is something a silicon designer can count on. By my estimations there is no point implementing more than 4-way SMT.

Which processors with SMT do that?
[...]

Switching right there also negates the point of branch prediction, since putting a thread to sleep right there means halting instruction fetch, which is why we're predicting branches in the first place.

Looks like Core i7 does that. It seems like an obvious thing to do when facing the task of selecting an instruction from threads that are otherwise equivalent. And frankly I care little about what's done today and more about what could be done in the future.

Branch prediction is still very useful because there can be other causes of latency (most notably cache misses) that force it to run speculative instructions anyway. In those cases it's good to know that you have a 99% chance you're still computing something useful. 4-way SMT would not suffice if you got rid of branch prediction altogether. But going to extremes to avoid any speculation means slowing down your threads to a crawl and introducing all sorts of other issues. Furthermore, single-threaded performance will remain important. Even the most thread friendly application inevitably has sequential code, and you want your processor to execute that as fast as possible.

The paper indicates that SMT leaves the overall CPI insenstive to branch prediction accuracy (much of the figures are normalized to a YAGS scheme).
The discusion of a branch misprediction as a long-latency event was not particularly in-depth and their treatment of the components of that penalty is unclear to me.
It's not particularly helpful in teasing out information on my point of emphasis: the amount of work an aggressively speculating OoO chip wastes.
I need baseline numbers not starting with an SMT chip.
Going by the general observation of 30% overall wastage for a OoO chip (probably an oldish figure, but likely close), how much lower is wastage in SMT?
Relative insensitivity is not interesting to me, if it means little change from an already wasteful baseline.

One of my former professors did some research on Characterizing the Branch Misprediction Penalty.

A rough approximation would be: branch density * misprediction rate * IPC * pipeline length. So let's say we have on average a branch every 10 instructions, a misprediction rate of 10%, an IPC of 2, and a pipeline length of 16. That's 30% wasted work, for the worst case scenario you'll find in practice. For computationally intensive code, with misprediction rates of 1-2 %, that figure quickly drops to something quite insignificant. Add SMT to the mix, and even the worst case isn't so bad any more.

Heck, even a 30% wastage didn't sound all that terrible to me. How much does a GPU waste by having a minimum batch size of 32 or 64? Drawing a 10 pixel character in the distance takes as long as drawing a wall that covers the entire screen because you have 90% of the chip sitting idle. The figures given by OTOY are quite shocking. Also, try running Crysis with SwiftShader at medium quality at 1024x768. It may only run at 3 FPS, but that's merely a factor 10 from being smoothly playable, on a processor that has no texture samplers, no special function units, no scatter/gather abilities, only 128-bit vectors, and damned speculation...

There are problems with the way GPUs work, but I don't see anything wrong just by having a lot of data on chip. It's actually preferable in many ways.

Why keep lots of data in registers when it's only going to be used eons later? Look back at the time when texture samplers and ALUs formed a single pipeline, like the NV40. Some estimates say it has a pipeline length of 256, meaning that you have to be able to store the data of 256 pixels in flight. With 32 registers of 16 bytes each, and 16 pipelines, that would be a whopping 2 MB, for just the pixel pipelines, on 200 million transistor chip. Other sources claim it only had storage for four temporary registers, which is very understandable. It might have been enough for games that only barely started using shaders, but its performance would decimate when trying to run today's shaders. Like you say, even 8 registers doesn't cut it any more either.

So clearly there is such a thing as keeping too much data on chip. It's wrong, and not preferable. You should only keep as much data around as is necessary for hiding actual latency. Today's GPU architectures are far better in this respect than NV40, but they're still not ideal. I agree with Voxilla that they need a cached stack, both for allowing register spilling without decimating performance, and to have unlimited programmability.

Yes, if resource contention is a problem or there are no instruction slots unused, SMT can't manufacture extra performance.

The instruction selection window isn't infinite. When I said "out-of-order execution simply does not leave that many cycles unused", it was in the context of reaching over 1 IPC with computationally intensive code. I didn't say it leaves no cycles unused. And that's why there's still a significant opportunity for SMT.

Voxilla · Aug 11, 2009

Nick said:
Lets go back to the time when texture samplers and ALUs formed a single pipeline, like the NV40. Some estimates say it has a pipeline length of 256, meaning that you have to be able to store the data of 256 pixels in flight. With 32 registers of 16 bytes each, and 16 pipelines, that would be a whopping 2 MB,

As far as I remember the NV40 used the length of the pipeline itself to hide texture sampling latency. It didn't do SMT to hide latency.
So mostly the only registers were those between pipeline stages. For branching obviously this is not a good design.

Jawed · Aug 11, 2009

Nick said:
It's not an abomination. NVIDIA simply looks at it from the software developer's point of view.

If you've got a good reason for not calling it a strand in a SIMD-based architecture, then I'm all ears.

When you're writing a shader, you describe how a single pixel or vertex should be processed. How the hardware parallelizes it is irrelevant. It could be GT200 processing one scalar from 32 threads at a time, it could be RV790 executing 64 threads in parallel, it could be a single-core CPU processing one element at a time in a loop, etc. It's the hardware (and its driver) that aggregates the threads written by the developer into larger threads that process elements in parallel, or not.

Clearly the developer needs to be aware that the hardware enforces groupings - Brook+/CUDA/OpenCL create an explicit hierachy of execution domain, thread-groups (to allow strands to exchange data with each other), threads and strands. There's just a host of confusing names that obfuscate the essentials of the underlying hardware.

Looking at Larrabee, it's even possible to have 'multi-threaded shaders' if you want.

Fibres, a term that predates Larrabee.

So even a single shader doesn't have to be a single thread. They made the mistake of explicitly processing qquads in a single thread though. That might very well change in the future...

Why's it a mistake? You're asserting that 4-way SMT is all that's needed, that's what Larrabee has. How're you going to hide the rest of the latency?

I'm not saying it's a waste per se. It's obvious you need extra registers to run more threads. I'm just saying it's a waste when you're hiding latency that isn't there. In that case you're keeping data on the chip for longer than is necessary. GPUs do this by not always issuing independent instructions in the next cycle. ATI compensates this by processing independent vector components in parallel.

GPUs are sized for a typical worst-case. G80's register file is clearly too small. Oh and 24MB caches on CPUs say hello. Most of the data sat around in those monster L3s gets used once. Why's it sat there any longer than it's sole use?

As we move forward, simply processing more threads doesn't seem the best option. Nobody likes extra threads.

Data-parallel processing is blind to the number of strands.

What do you mean they're terribly mismatched to the API? Are you talking about shader models or the entire graphics API? What makes you think Larrabee matches the API so much better?

They're terribly mismatched because modern games leave 90% of the resources unused, apparently. Larrabee's resources come in 5 distinct flavours:

texture units and their local cache
cache hierarchy L2,L1
4-way SMT x86 core with 16-wide ALU
ring bus
memory controllers

I'll be here all day if I try and construct a list for a GPU. There's so many of them that it's impossible to use them close to efficiently, they're constantly fighting each other - apparently it's so bad they're seeing 10% utilisation in typical high end games :???:

GRAMPS is worth looking at:

http://graphics.stanford.edu/papers/gramps-tog/

I don't think we know anything definite about future GPUs yet, do we? The architecture of G80 was a complete surprise, and it looks like G300 will be something entirely new as well. ATI may manage to stay highly competitive by scaling and tweaking the current architecture.

Larrabee's scaling opportunities seem limited. And its entire succes as a GPU depends on the software...

A graphics-specific-fixed-function-less future is assured - it's merely a matter of how long. I have to admit it feels a bit weird to be trying to persuade the author of Swiftshader...

Too early to tell how well Larrabee will work, but the "big name" GPUs certainly aren't the only way to implement a rendering pipeline.

Hopefully people like Sweeney will actually deliver their graphics/game engine running on Larrabee in the near future. The Project Offset folks should also be interesting, as they're developing an engine.

Jawed

Nick · Aug 11, 2009

Voxilla said:
As far as I remember the NV40 used the length of the pipeline itself to hide texture sampling latency. It didn't do SMT to hide latency.

That's my understanding too.

So mostly the only registers were those between pipeline stages. For branching obviously this is not a good design.

Well, no, you also need temporary registers; the ones Shader Model 3.0 names r0 to r31. They don't strictly need 32 registers per pixel, but when the shader uses more registers than are physically available the GPU has to lower the number of pixels being processed.

Jawed · Aug 11, 2009

Nick said:
Why keep lots of data in registers when it's only going to be used eons later? Lets go back to the time when texture samplers and ALUs formed a single pipeline, like the NV40.

Woah, that's a ridiculous idea, as you can no longer re-order threads to adapt to variations in latency of texture results, i.e. this makes the worst case latency for every pixel the same as the worst case latency for the population of pixels in flight. This is much worse than what current GPUs achieve.

It's like running both branches of this for every pixel in strict pixel order (i.e., imagine a 256-wide SIMD and think about the consequences of control flow incoherency penalty):

Code:

if (all texels in cache)
    filter and continue with ALUs
else
    while (all texels are not available)
       twiddle thumbs
    filter and continue with ALUs

It's stone-age.

Some estimates say it has a pipeline length of 256, meaning that you have to be able to store the data of 256 pixels in flight. With 32 registers of 16 bytes each, and 16 pipelines, that would be a whopping 2 MB, for just the pixel pipelines, on 200 million transistor chip.

It prolly supports 4 registers. It's a notoriously register-constrained architecture. It was so bad that NVidia went to great lengths to get people using fp16s instead of fp32s.

Other sources claim it only had storage for four temporary registers, which is very understandable. It might have been enough for games that only barely started using shaders, but its performance would decimate when trying to run today's shaders. Like you say, even 8 registers doesn't cut it any more either.

GT200 can't do 8 vec4 registers for 1024 strands per multiprocessor (the maximum) - it can only support 512 strands. Once the population falls below some threshold then you can't hide the total latencies arising from: register-read-after-write, incorrect-branch-flushes, texturing, gather, scatter, shared memory, atomics etc.

NVidia's architecture, being "serial scalar", allows some optimisations to reduce the effective register allocation. So register allocations are not as bad as the HLSL compiler might make you think. Though if done too tightly this constrains the hardware's ability to execute instructions out of order.

I agree with Voxilla that they need a cached stack, both for allowing register spilling without decimating performance, and to have unlimited programmability.

A stack is no use, assuming one per thread (a threaded-stack :smile

as it can't support retrieving arbitrarily spilled context in arbitrary order. It needs to be a buffer. Which is no different from just using the register file to "cache" context that lives in memory - something you can implement long-hand in CUDA/OpenCL, just it performs abominably because the GPU hardware hasn't, heretofore, been designed with this as a common use case. For instance, the amount of state for weather microphysics:

http://forum.beyond3d.com/showthread.php?t=49266

is particularly large and troublesome. So as I said earlier in the thread, the hardware needs to maintain a substantial minimul population of threads in order to attempt to hide the latency incurred in spilling/retrieving - which is really just scatter/gather. But the compilers tend to just allocate the maximum number of registers - though CUDA supports constraining the allocation. Dunno if OpenCL does.

The alternative is to spill/retrieve infrequently enough, and with enough instructions that can continue to execute in the shadow of the spill/retrieve, that the total latency experienced isn't major. The issue is that unlike the case where texturing-latency is minimised with caching/pre-fetching (at least in ATI), spilling/retrieving is fully-exposed, uncached.

Jawed

Nick · Aug 11, 2009

Jawed said:
Woah, that's a ridiculous idea...

It wasn't a suggestion, I was merely pointing back to the past to illustrate the effect of keeping lot of data on chip. It's where we don't want to go! I should have made that a little clearer.

Voxilla · Aug 11, 2009

Nick said:
Well, no, you also need temporary registers; the ones Shader Model 3.0 names r0 to r31. They don't strictly need 32 registers per pixel, but when the shader uses more registers than are physically available the GPU has to lower the number of pixels being processed.

Yes I know, I thought of the 4 registers being physically between the pipeline stages too, not sure though or even if this can work.

Fast software renderer

Jawed

Voxilla

Jawed

Blazkowicz

Jawed

Voxilla

Nick

Voxilla

Jawed

Nick

Rolf N

Recurring Membmare

Nick

3dilettante

Nick

Voxilla

Jawed

Nick

Jawed

Nick

Voxilla

Similar threads