The Official NVIDIA G80 Architecture Thread

I wonder why there are no DB figures for the xt1900

It's a conspiracy! It turns out ATI DB performance was so bad, they paid off Mike to suppress the results, using the broken GL driver excuse!

(for the humour impaired, it's a joke!)
 
Thinking more about it.. since it's now possible to hide texture latency via arithmetic ops having a 16 kb register file per cluster should not be as bad as I thought.
Each fp32 register consumes 16 bytes.

If each fragment in flight is assigned 2x fp32s in the register file, that's 32 bytes.

There's 32 fragments per batch. That's 1KB.

So, 16KB would allow for only 16 batches in flight, per cluster.

Jawed
 
Each fp32 register consumes 16 bytes.
Why? it's a scalar architecture. IMHO they simply fetch 4bytes x 16 per operand per clock per cluster from the register file.
If each fragment in flight is assigned 2x fp32s in the register file, that's 32 bytes.

There's 32 fragments per batch. That's 1KB.

So, 16KB would allow for only 16 batches in flight, per cluster.

Jawed
No, you're still thinking in terms of a non scalar architecture :)
2 fp32s would simply takes 8 bytes -> 2048 pixels in flight
In my previous post I was taking in consideration a shader using 4 vec4 regs -> 256 pixels in flight -> 16 clock cycles to execute a single scalar op on all of them -> 7 mem cycles that we can hide per scalar op.
One would need 16 cycles worth of math per pixel to hide a bit more than 100 mem cycles (and given the clock domain weighted ratio between the number of TAs and the number of SPs would not make sense anyway to issue a texture fetch with a frequency < 10 ALU cycles)
 
You need to actually hold the contents of registers in memory somewhere.

What happens when the shader does a sequence of scalar ops across all fragments in flight? 256 fragments in flight (your number, not mine) will take 16 clocks for a scalar instruction, at 1350MHz. Does G80 hide the latency then?

Jawed
 
You need to actually hold the contents of registers in memory somewhere.
that is what the register file is for.
What happens when the shader does a sequence of scalar ops across all fragments in flight? 256 fragments in flight (your number, not mine) will take 16 clocks for a scalar instruction, at 1350MHz. Does G80 hide the latency then?
Of course it doesn't, SPs would stall. But my premise was that this shader uses 4 vec4 registers, so it MUST do much more ops to use all that input and output operands, that's why your scenario is highly unlikely ;)
A shader that just performs a single scalar op on some data fetched from a texture might use just a couple of temporary registers and in that case you would have 16 * 8 = 128 ALU cycles, still not enough but that's for sure a much better scenario :)
 
It seems each SP can compute a scalar madd per cycle so it needs 3 FP32 as input and 1 FP32 as output.
If they store the same register duplicated 16 x N (N=2?) times in contiguous locations in the register file they might need to split it in something like four 4kb banks, each bank having a 512 bit data path.
wrote enought bs tonight, it's time to go to bed :)
 
Of course it doesn't, SPs would stall. But my premise was that this shader uses 4 vec4 registers, so it MUST do much more ops to use all that input and output operands, that's why your scenario is highly unlikely ;)
OK, well you've decided to draw an arbitrary line at a 16KB register file, which is clearly incapable of hiding the latency of lots of situations.

I just picked the most obvious exception.

Jawed
 
OK, well you've decided to draw an arbitrary line at a 16KB register file, which is clearly incapable of hiding the latency of lots of situations.
actually you failed to show such a situation since your hypothesis can't be valid if you take my premise :)

I just picked the most obvious exception.

Jawed
and it's wrong, good night :)
 
It seems each SP can compute a scalar madd per cycle so it needs 3 FP32 as input and 1 FP32 as output.
What about a co-issued attribute interpolation? Or scalar special-function?

If they store the same register duplicated 16 x N (N=2?) times in contiguous locations in the register file they might need to split it in something like four 4kb banks, each bank having a 512 bit data path.
wrote enought bs tonight, it's time to go to bed :)
Yeah, bedtime for me too.

Jawed
 
Because branching doesn't work like we want under GLSL with ATI... They try to unroll loops and predicate branches in all public drivers. We have DX versions of the test, they used to behave, but there was a subtle change in the DX spec that is breaking the test for *both* vendors, so we need to figure out how to fix that...

This is not a problem I'm aware of. Do you have an example of when this would happen and you don't expect unrolled loops / predication?
 
It seems each SP can compute a scalar madd per cycle so it needs 3 FP32 as input and 1 FP32 as output.
If they store the same register duplicated 16 x N (N=2?) times in contiguous locations in the register file they might need to split it in something like four 4kb banks, each bank having a 512 bit data path.
wrote enought bs tonight, it's time to go to bed :)

Why would you need to dupe? You don't mean actual duplication of the contents of the register?
 
I think he means "duplicated" in the sense that there is another entry for each fragment in the batch. The "space-taken" or "cost-of-storage" is duplicated, not the contents :)

I would have expected that the register file is split into 'n' banks where 'n' == number of stream units in the cluster. I would not expect per-fragment registers to be required across fragments, and I would expect a fragment to remain bound to a streaming unit, so that keeps the data-width fairly manageable. There may be exceptions (SFU, TXU access may differ slightly, and I seem to recall mention of certain DirectX commands requiring access across fragments), but I would think you would optimize for ALU access, bank-access SFUs or TXUs as required, and take some kind of hit for the other exceptional cases. :shrug:

Ed: so the way the dispatcher would work would be that it issues the same load instructions with the same address to all SIMD/st processors -- each streaming unit goes to its own dedicated RAM, pulls its local register values, performs its op, and stores appropriately. That seems simpler to me than some kind of super-wide bus with a flat address space, with the dispatcher pulling all the data and then parceling out the data to each ALU, although the results are much the same....
 
Last edited by a moderator:
So there seem to be 4 types of storage that people are talking about in possibly 2 different clock domains, and I'm wondering how they are related. In the SP (fast) clock domain is the per-thread registers. In the sampler (main, slow) clock domain is per-cluster texture L1 cache. I'm guessing, based on the "As Fast Registers" bullet, that CUDA's "Parallel Data Cache" is in the SP domain. Finally there's most likely a constant buffer cache somewhere.

I don't think anyone has tried to experimentally calculate it, but we've theoretically calculated that the register file needs to be on the order of 8-32KB. The B3D review calculates the texture L1 cache as 8KB. The CUDA docs explicitly state the PDC as 16KB. We obviously know nothing about any CB cache since DX10 drivers don't exist yet.

So now the question is how are they all related. It seems sensible that the register file and texture cache are unique, since they are in different clock domains. The PDC seems odd to me. In the code example it has "size_t SharedMemBytes = 64; // 64 bytes of shared memory", which I assume is requesting 64 bytes of shared PDC storage. If you can request such a small size, it must mean that the PDC is shared with something. If the PDC is in the SP clock domain, then the obvious thing to share it with is the register file. I don't know enought about CBs to speculate as to where caches for it would be located.

So my guess is that there is an 8KB texture cache in the main clock domain and a 16KB register file in the SP domain that can be accessed either as thread-local or as batch-local.

The problem is that the 16KB register file would have to have 16x3 read (16 SPs * 3-op MADD / cycle) + 16 write ports, which is practiaclly impossible to design from a circuit perspective. This might be feasible if there were 16 banks each 3-read, 1-write. However this depends on how the PDC shared memory works. In the worst case you'd need to get 3 shared-memory operands from 3 different banks to all 16 SPs.
 
What about a co-issued attribute interpolation? Or scalar special-function?
I was analyzing what I consider to be some kind of minimum register file bw arrangement.
When you have other ops being dual scheduled you can surely need more register bw so the hw might be able to transfer more data from the register file than what I proposed or might now forcing only the 'coupling' of instructions that shares some arguments. Probably the truth is someway in the middle.
 
Why would you need to dupe? You don't mean actual duplication of the contents of the register?
Poor choice of words, my bad. I meant that enough space must be allocated in the register file to save the content of the register X for the M pixels in flight (beining M probably a multiple of 16)
 
The problem is that the 16KB register file would have to have 16x3 read (16 SPs * 3-op MADD / cycle) + 16 write ports, which is practiaclly impossible to design from a circuit perspective. This might be feasible if there were 16 banks each 3-read, 1-write. However this depends on how the PDC shared memory works. In the worst case you'd need to get 3 shared-memory operands from 3 different banks to all 16 SPs.
What we can reasonably expect is to have SPs working all together within a cluster on the same instruction at any given clock cycle so that you don't need a register file with a lot of ports but just one with a very wide data bus, as long as you can be smart when register space is allocated in the register file
 
Here's three documents:

Simulating Multiported Memories Using Lower Port Count Memories
I haven't studied this in any detail.

System and Method for Reserving and Managing Memory Spaces in a Memory Resource
I've barely looked at this.

Shader Pixel Storage in a Graphics Memory
This is prolly the killer: being able to write data to framebuffer memory and then read it again, within the same shader. If you assume that there's a cache between you and framebuffer memory, then that cache becomes the Parallel Data Cache. By far the best candidate for this is the L2 memory that's associated with each set of ROPs, which are also associated with a memory controller.

Jawed
 
Back
Top