Dynamic Branching Granularity

Nick · Jun 13, 2006

Hi all,

Can anyone explain me why branching granularity isn't one quad? I don't see where the complexity comes from. In fact I would have expected it to be easier to track branching on a per-quad level.

Also, what will change about branching in the future with unified architectures? Will pipelines be capable of independent branching for processing vertices individually, and be configured in quads for pixel processing?

Thanks,

Nick

Demirug · Jun 13, 2006

The size of a batch job has a direct impact on the complexity of the job scheduler.

Example:

If you have 48 ALU/FPU and you want your batch size 4 pixels your scheduler must be able to invoke 48 jobs per clock. Let’s double the size of the job and you reduce the work of your scheduler to the half. Less work means fewer transistors and this will save space.

In the case of nVidia things are even more complex. They don’t use a job scheduler that dynamically assign jobs to the ALU/FPUs. It’s more like a large ring buffer that contains data and control code at the same time. Every block of pixel starts with a least one control token. This control token configured the ALU/FPU for the following pixels. During this configuration the ALU/FPU don’t process pixels. This means that every time a control token is executed you lost a clock. If you now make the batches smaller you will need more control tokens and reduced performance in non-branching cases.

I don’t think that we will see fully independent vertex branching in a unified environment.

Nick · Jun 13, 2006

Thanks for the information Demirug!

Unfortunately I think I still don't grasp the basic concepts. As far as I know dynamic branching in pixel shaders requires executing the shader once, checking which pixels in the quad should have taken different branches, and repeating the shader as long as there are pixels that were forced to take the incorrect branch (keeping track of which pixels do have been computed correctly and selecting which one is the next 'master' to determine what branch to take for the whole quad). This should guarantee to execute the shader only as many times is necessary to computer all pixels in a quad correctly (i.e. one to four times).

When using a bigger granularity, say 4x4 pixels for an architecture with 4 quads, the shader has to potentially be executed 16 times, and keep track of more branching. I do realize it's easier to have 16 pipelines run the same shader instance (i.e. they execute the same instruction at the same time). Is that the decisive factor? Do they assume high correlation between the branches taken in neighboring pixels (what about the future)? Can ATI hardware only run 3 shader instances (I assume this is what you call jobs), 4 quads being coupled to the same scheduler?

For NVIDIA hardware, is each quad pipeline capable of executing a different shader instance, but there's a significant overhead for starting/changing them? How did they improve granularity for G70 without affecting performance then?

What's the direction they both 'should' be taking for the next generation?

My whole understanding of GPU architecture is on a very abstract level so please bear with me... I'd like to know more about it to better understand the possibilities and limitations on current and future hardware. I'm very excited about Direct3D 10 to do GPGPU stuff, but if incoherent branching remains innefficient then that rules out some applications (e.g. real-time raytracing). Future CPUs will have wider SIMD units, dual/quad/octa-cores, they have efficient branching, and independent scheduling that might even cross execution units (cfr. Reverse Hyper-Threading). So will GPUs remain only really interesting for rasterization or will they be able to compete with increasingly more parallel CPUs even at branchy code?

Demirug · Jun 13, 2006

Dynamic branching on a GPU works like this.

Until the shader reaches the first branch instruction there is nothing special. But during the execution of the branch instruction it will generate some kind of bit mask that shows for every pixel in the job the path that it have to take. If all pixel have to take the same path the GPU will simply jump to the first instruction of this path and continued the execution. If the job contains pixel of both path it will first execute one off them and after it reach the assigned end branch instruction the other one. During the execution of both paths an additional logic will block the shader unit to write the result to the registers of these pixels that should not execute this path. Overall this means independent from the job size the maximum number of instructions that need to be executed are always equal to the sum of instruction in both branch paths.

Normally each shader unit is fully independent of the other units. This makes it easier to disable one if there is an error.

The main advantage of GPUs against CPUs even in branching code situations is the higher memory bandwidth and the ability to compensate memory latency’s. Thing about a kind of hyper threaded CPU that runs hundred of threads and can switch fast from one to another if one stall while waiting for data from the memory.

Geo · Jun 13, 2006

Demirug said:
I donâ€™t think that we will see fully independent vertex branching in a unified environment.

You lost me there. I didn't get why this would follow from your explanation. A few more words on why not, and how much of an ouchie it is?

Dave Baumann · Jun 13, 2006

I don't know if its exactly what Demi is talking about but bear in mind that with current NVIDIA processors each vertex shader is independant of another so the branching granularity there is small; in a unified environment vertex data is processed over the ALU's in the same manner as pixel data so the they operate at the same granularity (i.e. Xenos's batch size is 64 pixels which means 64 vertices as well).

Jawed · Jun 13, 2006

You need to think of current GPUs as consisting of pixel pipelines that are able to hide the somewhat random latency induced by texture mapping.

The traditional approach is to make the pipeline really long (e.g. 220 clocks in NV4x and G70) so that the time taken to issue a texture mapping request and get back a result can be hidden.

A pipeline consists of stages required to:

fetch the operands for an instruction (or for the co-issued instructions)
swizzling and organisation of operands for co-issue
execute the instruction(s) (or request a texture and, optionally, complete the instruction)
write the results of the instruction(s) back to the register file

Requesting a texture requires:

calculation of the required texels' addresses
fetching the texels (which may require a fetch from memory, or they may be in cache)
a "wait" period to allow the texels time to be fetched
filtering
write the results back to the register file (and/or provide them back to the pixel pipeline for immediate processing)

So a requested texture comes back "just in time" for the pipeline to use it.

The pipeline's overall length is the sum of these stages, with the "texel fetch wait" period being designed (presumably) as some kind of "typical" worst-case. More complex texture mapping (multi-texturing and/or trilinear/anisotropic filtering) requires multiple extra fetches and filtering steps to be performed, beyond the limits of the "worst-case wait". This is where things get really fuzzy for me, but it doesn't really affect the point of what I'm saying.

A pipeline normally processes 4 pixels at a time, because this makes for nice coherent accesses to memory to read texels and it makes for nice coherent computation of bilinear (or better) filtering. So right there you get the basic unit of a "batch": 4 pixels and 220 stages equals 880 pixels.

When a GPU is sized-up to process 16 or 24 pixels at a time, you can simply multiply the quads that are all running the same instruction in parallel, from the 1 quad original upto 6, say. 6 quads would make a batch size of 5280 pixels. NVidia actually only went as far as 4 quads though, in NV40-45.

In NV47 (G70) NVidia made each of the quads independent of the others. The primary effect is that texturing in each quad can run "out of step" with its neighbours. There's a patent about issuing quads out of step with each other, the effect is to even-out the load on memory and reduce the worst-case latencies when texturing. It also reduces the size of a batch, which obviously makes dynamic branching more granular (as compared with NV45, say, where a batch consists of 4 quads in lock-step, 3520 pixels).

ATI designed its pixel pipeline a bit differently. The idea is that a lot of the time a texture fetch isn't needed immediately by the pixel pipeline - instead in two, three or more instructions' time. So texturing is performed "aysnchronously" and the pixel pipeline tries to continue to process succeeding instructions while the texture results are being produced. It doesn't always work out which is when you get stalls.

Now the size of a batch can be smaller (e.g. 64 quads = 256 pixels) because the pipeline only needs to be long enough for non-texturing work - it's now an ALU-only pipeline, in effect. The latency-hiding "wait" stages are no longer part of the overall pipeline length, instead waiting is the responsibility of the separate TEX pipeline. (The size of a batch in ATI's R3xx GPUs was fixed, seemingly at 256, but the size of a batch in R4xx GPUs can be less or more.)

In R5xx, ATI changed the architecture so that the pixel pipeline no longer works on a single batch until all the instructions of the shader are executed (ALU or TEX). Texturing latency is now hidden not just by hopefully executing two or three succeeding instructions whilst the texture operation is performed, but by executing instructions for other pixels that aren't even in the same batch.

The problem now is how short can you make the ALU pipeline? You still have to spend cycles fetching from the register file, organising co-issue etc. With the pipeline lengthened by simultaneously working on multiple batches, it's now a matter of how many batches can be supported. Each batch requires separate instruction decode and each batch will require a different fetch/store access-pattern in the register file (which means increased latency). So as each batch is added to the design, the complexity of the pipeline increases - extra transistors.

Additionally it's difficult to have the pipeline work on different batches on each succeeding clock. So ATI has settled on 4 clocks per batch. In R520 and RV515 this means 16 pixels in a batch, 4 clocks per quad. In R580 and RV530 three quads are processed by a pipeline simultaneously, so you have 4 clocks x 12 pixels = 48 pixels in a batch (though the texturing pipeline is 1 quad wide, so the 3 ALU quads have to take it in turns to request texturing). That's the result of ATI's desire to create an architecture where 3 ALU instructions are processed for each TEX instruction.

Ultimately it's a question of spending transistors. Currently it doesn't make sense to texture in units smaller than 1 quad (coherency is lost - smaller TMUs amounting to the same overall texturing capability will incur more latency and use more transistors), so that determines the minimum width of a pixel pipeline. Then you have the turn-around time for a pipeline: the minimum number of stages in which operands can be fetched, organised, executed-upon and stored versus the number of batches you're willing to put into the pipeline. As you increase the number of batches to reduce the number of pixels per batch, you create overheads in terms of supporting these multiple contexts.

Theoretically you could create a pipeline that only spends 1 clock on each batch (instead of the current 4 in R5xx) but the complexity of the pipeline would be immense. And you'd also have to increase texture cache sizing and complexity to account for an increase in the turnover of batches requesting textures, hence much lower cache coherency. Though this should be mitigated by the batches, themselves, being coherently scheduled (at least some of the time).

So, overall, batch size is a compromise of hiding the latency of texture mapping versus the complexity of pipeline design to support multiple batches. The support for dynamic branching in R5xx and Xenos comes directly out of the ability to support multiple batches per pipeline - the granularity of branching is really down to the overheads incurred in supporting multiple batches.

Jawed

Nick · Jun 13, 2006

Demirug said:
Until the shader reaches the first branch instruction there is nothing special. But during the execution of the branch instruction it will generate some kind of bit mask that shows for every pixel in the job the path that it have to take. If all pixel have to take the same path the GPU will simply jump to the first instruction of this path and continued the execution. If the job contains pixel of both parth it will first execute one off them and after it reach the assigned end branch instruction the other one. During the execution of both paths an additional logic will block the shader unit to write the result to the registers of these pixels that should not execute this path. Overall this means independent from the job size the maximum number of instructions that need to be executed are always equal to the sum of instruction in both branch paths.

Interesting! I had the impression that the whole shader was executed again for each pixel taking different branches. I now realize that's not necessary. For nested branches I assume then there's some kind of stack to keep track of where to return and what still needs to be executed, for which pixels? Is this the task of the control tokens used by NVIDIA?

The main advantage of GPUs against CPUs even in branching code situations is the higher memory bandwidth and the ability to compensate memory latencyâ€™s. Thing about a kind of hyper threaded CPU that runs hundred of threads and can switch fast from one to another if one stall while waiting for data from the memory.

With multi-core they also started to realize the need for high memory bandwidth for the CPU. DDR3 and XDR are already on the roadmaps for quad/octa-core. So I believe that in a couple years memory technology for GPUs and CPUs will converge again. Also don't underestimate a CPU's caches. Core 2 Duo will already have two 2 MB L2 caches. Lots of data for say real-time raytracing can reside there while a GPU would have to use precious RAM bandwidth for that with higher latencies. Unless they also start using large caches. It's impossible to hide latencies of dozens of memory accesses through threading. In fact you'd need a cache-like structure to store all thread register sets.

And that would only 'solve' the latency problem. Core 2 Duo has for each core two 128-bit busses to the cache, running at around 3 GHz. That's 1.5 Tb/s bandwidth. Anyway, I expect CPUs will also re-introduce Hyper-Threading and Inverse Hyper-Threading, but just two threads per core. So if GPUs don't get highly efficient low granularity branching then I'm not sure if they can be used for much more than advanced rasterization.

Rys · Jun 13, 2006

Jawed: You sure you have your batch size correct for NV4x? Where'd you get that 220 cycle count from anyway?

Demirug · Jun 13, 2006

Dave Baumann said:
I don't know if its exactly what Demi is talking about but bear in mind that with current NVIDIA processors each vertex shader is independant of another so the branching granularity there is small; in a unified environment vertex data is processed over the ALU's in the same manner as pixel data so the they operate at the same granularity (i.e. Xenos's batch size is 64 pixels which means 64 vertices as well).

Thatâ€™s exactly what I mean.

Jawed · Jun 13, 2006

Rys said:
Jawed: You sure you have your batch size correct for NV4x? Where'd you get that 220 cycle count from anyway?

Bob.

Jawed

Bob · Jun 13, 2006

Can anyone explain me why branching granularity isn't one quad?

There are lots of issues with branching on a per-quad granularity:
- You'd need to replicate the program counter, primitive id (for fetching interpolated attributes), branch state, stack, and misc logic per quad, instead of per batch. For a 16x16 batch, that implies 64x more hardware.
- You now need to make the scheduler much larger so that it can schedule 1 quad/clock/ALU instead of 1 batch/ALU in multiple clocks. Resolving dependencies, routing data and registers, etc gets really complicated (== more area) when you need to do this at very fine granularity.
- It's difficult to make large register files that have 3 read ports/1 write port (for MADs), no bank conflicts and can do random register accesses per clock per quad without sacrificing a huge amount of area.
- Your texture cache efficiency falls through the floor as now potentially random quads will flow through it. Synchronizing quads into larger batches (equivalent to sorting quads based on program counter) is even more hardware expenses to try and recover some cache goodness after divergent branches re-join.

I'm sure there are many other issues. I just can't think of them right now.

Suffice to say, I don't think we'll see quad-level branching in the near future. Even R580 is moving away from that goal by tripling the batch size from R520.

Rys · Jun 13, 2006

Nick said:
Interesting! I had the impression that the whole shader was executed again for each pixel taking different branches. I now realize that's not necessary. For nested branches I assume then there's some kind of stack to keep track of where to return and what still needs to be executed, for which pixels? Is this the task of the control tokens used by NVIDIA?

Pretty much. The hardware maintains info for nested branches where it has 'flattened' the branches out 1 sub level deep. So it knows the execution path for that nested branch segment, and the next branch level (as far as I know and understand), and uses the token information to determine the state of the current branch execution and go on from there into new branch levels, masking pixels as Demi explains, so you aren't processing pixels needlessly.

Demirug · Jun 13, 2006

Nick said:
Interesting! I had the impression that the whole shader was executed again for each pixel taking different branches. I now realize that's not necessary. For nested branches I assume then there's some kind of stack to keep track of where to return and what still needs to be executed, for which pixels? Is this the task of the control tokens used by NVIDIA?

As there is only a limited nested deep possible you donâ€™t need a stack for this. The whole command token system that nVidia use is not documented at all.

Nick said:
With multi-core they also started to realize the need for high memory bandwidth for the CPU. DDR3 and XDR are already on the roadmaps for quad/octa-core. So I believe that in a couple years memory technology for GPUs and CPUs will converge again. Also don't underestimate a CPU's caches. Core 2 Duo will already have two 2 MB L2 caches. Lots of data for say real-time raytracing can reside there while a GPU would have to use precious RAM bandwidth for that with higher latencies. Unless they also start using large caches. It's impossible to hide latencies of dozens of memory accesses through threading. In fact you'd need a cache-like structure to store all thread register sets. And that would only 'solve' the latency problem. Core 2 Duo has for each core two 128-bit busses to the cache, running at around 3 GHz. That's 1.5 Tb/s bandwidth. Anyway, I expect CPUs will also re-introduce Hyper-Threading and Inverse Hyper-Threading, but just two threads per core. So if GPUs don't get highly efficient low granularity branching then I'm not sure if they can be used for much more than advanced rasterization.

Believe me with hundred of threads you can hide the memory latency completely as long as you have enough bandwidth. The high number of threads makes it even possible to build long pipelines that can compensate additional cycles.

I know you are in the software render business and maybe this makes you a little bit CPU biased. But even if they donâ€™t execute branches in the same way as GPUs they only work at maximum speed if the branch predication can detect a pattern. If you take a look at IA64 you will see that it doesnâ€™t like branching, too. The general rule there is to calculate both paths and finally write only the result from one.

A GPU will never be a CPU but there will be more then render problems they will solve well in the future.

Nick · Jun 13, 2006

Jawed said:
The traditional approach is to make the pipeline really long (e.g. 220 clocks in NV4x and G70) so that the time taken to issue a texture mapping request and get back a result can be hidden.

Is that the texture sampler pipeline or shader pipeline? I assume it's the sampler pipeline because shaders really need the shortest latency possible for arithmetic instructions.

A pipeline consists of stages required to:

fetch the operands for an instruction (or for the co-issued instructions)

swizzling and organisation of operands for co-issue

execute the instruction(s) (or request a texture and, optionally, complete the instruction)

write the results of the instruction(s) back to the register file

Such shader pipeline can't be very long, right? There should be only one, or very few, execution stages, so the next instruction can use the result without waiting many (if any) clock cycles. GPUs don't have out-of-order execution, do they?

Requesting a texture requires:

calculation of the required texels' addresses

fetching the texels (which may require a fetch from memory, or they may be in cache)

a "wait" period to allow the texels time to be fetched

filtering

write the results back to the register file (and/or provide them back to the pixel pipeline for immediate processing)

So a requested texture comes back "just in time" for the pipeline to use it.

I can understand this sampler pipeline would take 220 clock cycles.

Is the 'wait' operation a pipeline consisting of just latches passing data to the next stage without any logic in between? So no matter if the data is in de cache or has to be fetched from RAM the latency is the same (with the cache only reducing bandwidth)?

The pipeline's overall length is the sum of these stages, with the "texel fetch wait" period being designed (presumably) as some kind of "typical" worst-case. More complex texture mapping (multi-texturing and/or trilinear/anisotropic filtering) requires multiple extra fetches and filtering steps to be performed, beyond the limits of the "worst-case wait". This is where things get really fuzzy for me, but it doesn't really affect the point of what I'm saying.

Why would the 'overall' pipeline length be the sum of sampler and shader unit stages? That only makes sense for Direct3D 7 class fixed-function processing where texture coordinates are fixed and all textures can be sampled before doing any arithmetic operations. With programmable pixel processing a texture can be sampled at any time and the coordinates are not known in advance.

A pipeline normally processes 4 pixels at a time, because this makes for nice coherent accesses to memory to read texels and it makes for nice coherent computation of bilinear (or better) filtering. So right there you get the basic unit of a "batch": 4 pixels and 220 stages equals 880 pixels.

Quads are needed for texture gradients (cfr. dsx/dsy instructions), needed for mipmapping.

When a GPU is sized-up to process 16 or 24 pixels at a time, you can simply multiply the quads that are all running the same instruction in parallel, from the 1 quad original upto 6, say. 6 quads would make a batch size of 5280 pixels. NVidia actually only went as far as 4 quads though, in NV40-45.

So... Are you saying that it executes the same shader instruction for a batch of quads sequentially, before it continues to the next shader instruction? That would allow longer execution latencies, but require massive amounts of memory for storing temporary registers for each pixel in the batch. And what happens with tiny polygons?

What am I missing?

ATI designed its pixel pipeline a bit differently. The idea is that a lot of the time a texture fetch isn't needed immediately by the pixel pipeline - instead in two, three or more instructions' time. So texturing is performed "aysnchronously" and the pixel pipeline tries to continue to process succeeding instructions while the texture results are being produced. It doesn't always work out which is when you get stalls.

The way I understand it, ATI's Ultra-Threading is very similar to Intel's Hyper-Threading. It hides latencies (mainly from texture sampling) by scheduling whichever instruction from a group of shader threads is ready to execute. Still in-order execution though. I think that's what you meant but please correct me when I'm wrong.

Now the size of a batch can be smaller (e.g. 64 quads = 256 pixels) because the pipeline only needs to be long enough for non-texturing work - it's now an ALU-only pipeline, in effect. The latency-hiding "wait" stages are no longer part of the overall pipeline length, instead waiting is the responsibility of the separate TEX pipeline. (The size of a batch in ATI's R3xx GPUs was fixed, seemingly at 256, but the size of a batch in R4xx GPUs can be less or more.)

...

The Ultra-Threading approach makes sense to me, but having texture sampling in the same pipeline still sounds very odd for a programmable pipeline.

Obviously there are still huge gaps in my understanding of GPU architectures at the hardware level... It's becoming clear that running GPGPU applications optimally is a daunting task and Direct3D 10 isn't going to revolutionize that much.

Thanks a lot Jawed!

Bob · Jun 13, 2006

Nick said:
Such shader pipeline can't be very long, right? There should be only one, or very few, execution stages, so the next instruction can use the result without waiting many (if any) clock cycles. GPUs don't have out-of-order execution, do they?

Let's do a little mental exercise. What if your ALU pipeline was 1000 cycles long. How would you feed it? If you treat it like an ordinary CPU pipeline, then dependent consecutive instructions will take a 1000 cycle hit because they need to wait for the result of the previous operation to be available before starting the next one.

What if instead of waiting 1000 cycles for those results, you just issue an ALU instruction from some other quad? That quad is guaranteed independent, so it can be scheduled on the ALU.

Ok, that second quad fills up 1 of the 1000 cycles of the ALU length. If you add the initial quad you ran, that's 2 out of 1000 cycles.

So what's next? Well that's simple: Just fill up the ALU pipeline with 1000 different quads. Back-to-back instruction dependencies will not have any stalls. Total latency goes up but the thoughput is maintained at 1 quad/clock.

Now, replace "ALU" with "Shader pipeline" in the above paragraphs and you can imagine what happens on the GPU.

Nick said:
Why would the 'overall' pipeline length be the sum of sampler and shader unit stages?

That's the total thread latency. This is what you need to cover with that thread and other shader threads. If your shader takes 20 clocks through ALUs and 100 clocks through texturing, you need to keep around its registers for 120 clocks. Hense to run at full speed, you need 120 threads to run serially. Notice that this rule is independent of the actual implementation: ALUs and Texture units can be separate, or combined, or there could be multiple ALUs or whatever. What matters here is the total latency of the system.

Nick said:
That would allow longer execution latencies, but require massive amounts of memory for storing temporary registers for each pixel in the batch. And what happens with tiny polygons?

Yes, it does require more memory. If you take the ~1k thread/quad pipe number being thrown around for NV4x, then you'll need 1024 * #registers/thread worth of space to run at full-speed. For 128-bit registers, that's 16KB * #registers/thread.

You can then pick a register file size, like 32 KB. 32KB lets you have 2 registers/thread and still run at full speed. If your thread needs more than 2 registers, then you run at RF_size / num_registers / register_size. For example, if you need 11 registers, you have a 32 KB register file, and your batch size is 1K pixels, you'll run at ~18% of peek performance.

As for how to deal with tiny polygons with large batches, the answer is simple: Allow for more than one polygon per batch. With small batches, the benefits of more than 1 primitive/batch are more questionable and introduce a whole lot of extra complexity.

Nick · Jun 13, 2006

Demirug said:
The whole command token system that nVidia use is not documented at all.

Obviously, but it's still surprising how much detail you guys know. I never even heard of these command tokens, and I'm lacking far more basic knowledge as well...

Believe me with hundred of threads you can hide the memory latency completely as long as you have enough bandwidth. The high number of threads makes it even possible to build long pipelines that can compensate additional cycles.

Same question I askek Jawed: Doesn't that require huge register files, 32 temporary registers per pixel per thread (for ps 3.0)? So an architecture with 16 pixels and 128 threads would require a 1 MB register file? That sounds like a cache to me, not a register file.

Then again, R580 is huge and abstract schematics (like this) show register files larger than the shader units. It's hard to imagine but is this reality?

I know you are in the software render business and maybe this makes you a little bit CPU biased.

Yes I'm in the software rendering business but I still have great interest in GPUs. It's just that at university I've taken every possible CPU architecture related course, but they didn't have any about GPU architectures. :-|

After my current project(s) I might do some GPGPU stuff nobody else is doing.

But even if they donâ€™t execute branches in the same way as GPUs they only work at maximum speed if the branch predication can detect a pattern. If you take a look at IA64 you will see that it doesnâ€™t like branching, too. The general rule there is to calculate both paths and finally write only the result from one.

Yes, I've seen statistics where IA64 is spending half the time waiting for data. I believe Hyper-Threading could do miracles for this architecture though.

A GPU will never be a CPU but there will be more then render problems they will solve well in the future.

Awesome. I'm still trying to get a good vision of where GPU technology is going versus CPU technology though. The parallelization revolution is over for GPUs, but for CPUs it's just getting started...

Jawed · Jun 13, 2006

Yes the register file in ATI's ultra-threaded architectures is huge. Though there's no GPU that can run at full speed if every pixel has a full set of 32 FP32s! Full-speed seems to be limited to three FP32s per pixel, apparently. That's still a lot of memory.

But, hey, memory's cheap...

If each pixel needs more FP32s, then less batches will be scheduled. It's really a matter of how much space there is in the register file, as Bob showed.

Jawed

Demirug · Jun 13, 2006

Nick said:
Obviously, but it's still surprising how much detail you guys know. I never even heard of these command tokens, and I'm lacking far more basic knowledge as well...

The patent office is your friend.

Nick said:
Same question I askek Jawed: Doesn't that require huge register files, 32 temporary registers per pixel per thread (for ps 3.0)? So an architecture with 16 pixels and 128 threads would require a 1 MB register file? That sounds like a cache to me, not a register file.

They donâ€™t store the full 32 register per thread in a memory block. The driver reorder the shader code to reduce the number of registers needed. Additional there are some tricks to â€œstoreâ€ values in the pipeline itself. If this still not enough the number of threads is reduced.

Nick said:
Then again, R580 is huge and abstract schematics (like this) show register files larger than the shader units. It's hard to imagine but is this reality?

Those are marketing diagrams. Donâ€™t trust what you see there. But today we get much more information than some years ago.

Nick said:
Awesome. I'm still trying to get a good vision of where GPU technology is going versus CPU technology though. The parallelization revolution is over for GPUs, but for CPUs it's just getting started...

CPU goes in the direction that they can run more threads at the same time. But the program has to spilt in threads to get some more speed.

GPUs will do this too (WDDM 2.1) but primary GPUs are build for massive SIMD operations. This means that they can add more and more ALU/FPUs and still scale without code changed. But the multitask support on GPU will help us to run multiple jobs (collision detection; neural network for the KI; graphics) at the same time on one chip

Nick · Jun 13, 2006

Bob said:
Let's do a little mental exercise. What if your ALU pipeline was 1000 cycles long. How would you feed it? If you treat it like an ordinary CPU pipeline, then dependent consecutive instructions will take a 1000 cycle hit because they need to wait for the result of the previous operation to be available before starting the next one.

...

So what's next? Well that's simple: Just fill up the ALU pipeline with 1000 different quads. Back-to-back instruction dependencies will not have any stalls. Total latency goes up but the thoughput is maintained at 1 quad/clock.

Thanks, I think the concept is staring to get through to me!

That's the total thread latency. This is what you need to cover with that thread and other shader threads. If your shader takes 20 clocks through ALUs and 100 clocks through texturing, you need to keep around its registers for 120 clocks. Hense to run at full speed, you need 120 threads to run serially. Notice that this rule is independent of the actual implementation: ALUs and Texture units can be separate, or combined, or there could be multiple ALUs or whatever. What matters here is the total latency of the system.

So it's the time between one shader instruction and the next, for a given quad? One 'loop' for all quads in a batch? And for NVIDIA this is a long time (big batch) so it can include a texture sample operation, while for ATI it's a much smaller time (small batch) which requires textures to be sampled separately and thread execution to be halted/resumed in a Hyper-Threading fashion.

I suddenly realize why every NVIDIA chip has the same number of texture samplers as pixel pipelines. :idea:

Thanks a lot Bob!

Dynamic Branching Granularity

Nick

Demirug

Nick

Demirug

Geo

Mostly Harmless

Dave Baumann

Gamerscore Wh...

Jawed

Nick

Rys

Graphics @ AMD

Demirug

Jawed

Bob

Rys

Graphics @ AMD

Demirug

Nick

Bob

Nick

Jawed

Demirug

Nick

Similar threads