ATI R500 patent for Xenon GPU?

j^aws · Apr 4, 2005

psurge said:
the wording of the supposed leak suggests to me that a single thread is either a group of 64 pixels or a group of 64 vertices.

I dunno,

Leak said:
"The shader core has 48 Arithmetic Logic Units (ALUs) that can execute 64 simultaneous threads on groups of 64 vertices or pixels. ALUs are automatically and dynamically assigned to either pixel or vertex processing depending on load."

...it seems to specifically mention 64 simultaneous threads?

i.e. 64 threads each having 64 pixels or vertices to process, a total of 4096 vertices or pixels?

EndR · Apr 4, 2005

we need some more inside information.. please!
A new leak, or somehing.. that can be dissected before E3.... starved for information here...

j^aws · Apr 4, 2005

EndR said:
we need some more inside information.. please!
A new leak, or somehing.. that can be dissected before E3.... starved for information here...

What's wrong with the 'old' leak!

j^aws · Apr 5, 2005

Okay, some finer tuning on my earlier speculation...

There would be 4 unified shader (US) units/cores in the R500,

US = 12 ALU + 4 TMU

If each thread can process 64 pixels or vertices, then the Arbiter scheduling to 12 ALUs doesn't seem to fit. Also the leak mentions "16 PS input interpolates per cycle" which suggests that those 48 ALUs aren't identical and there are 16 specialized ALUs. I'll call them sALU for want of a better word and these would be better paired with the TMU. So,

US = 8 ALU + (4 sALU + 4 TMU)

...so you have 8 processing 'units' either side of the Arbiter and 8 threads scheduled either side of the Arbiter and 16 threads per US unit.

This would suggest that 1 processing 'unit' (ALU, sALU or TMU) could work on a thread with 64 pixels or vertices.

And a total of 4 US units in the R500 would be linked via a bus, sharing their reservation stations of pixel and vertex threads fed by a shared cache and local eDRAM. Each US unit would work on 16 threads, i.e. 16-way SMT cores making the R500 capable of maintaining 64 threads.

Just some random thoughts!

Jawed · Apr 5, 2005

Good thinking!

Jaws said:
[US = 8 ALU + (4 sALU + 4 TMU)

...so you have 8 processing 'units' either side of the Arbiter and 8 threads scheduled either side of the Arbiter and 16 threads per US unit.

May I suggest that your sALU is a texture address ALU?

I was thinking along the lines of:

US = 8 ALU (pixel/vertex shader configured as 4D+1D) + 4 ALU (texture address) + 4 TMU

The only question I still have is where do you put the tALU (rather than "sALU")? In the shader core with the other ALU(s) or directly connected to the TMU? Right now it seems the latter would be better.

But I'm just blundering around here, what do I know?

Jawed

j^aws · Apr 5, 2005

Jawed said:
Good thinking!

Jaws said:

[US = 8 ALU + (4 sALU + 4 TMU)

...so you have 8 processing 'units' either side of the Arbiter and 8 threads scheduled either side of the Arbiter and 16 threads per US unit.

Click to expand...

May I suggest that your sALU is a texture address ALU?

I was thinking along the lines of:

US = 8 ALU (pixel/vertex shader configured as 4D+1D) + 4 ALU (texture address) + 4 TMU

The only question I still have is where do you put the tALU (rather than "sALU")? In the shader core with the other ALU(s) or directly connected to the TMU? Right now it seems the latter would be better.

But I'm just blundering around here, what do I know?

Jawed

Thanks!

Yeah, the texture address tALU would definitely fit the bill of the specialized sALU. I like the symmetry of the patent so I'd be inclined to put it alongside the TMU on the other side of the Arbiter. Thus maintaining 8 threads either side of the Arbiter and bringing balance to the universe!

And, yes, the 8 other ALUs would most likely be 4 vector ALU and 4 scaler ALU for general purpose threads and vertex and pixel threads. I'll designate them vALU and sALU.

US = (4 vALU + 4 sALU) + (4 tALU + 4 TMU)

and

R500 = 4 US

I'm also thinking that the vertex and pixel Reservation Stations feeding the Arbiter of each US maybe be unified. They maybe a partitioned single lump of local memory like SRAM?

Jawed · Apr 5, 2005

In logical terms there's no reason to partition the reservation stations into separate blocks of RAM:

- a pixel command thread consists of shader instructions
- a vertex command thread consists of shader instructions

But in addition to the code, you also need to keep the state of a thread, each time it's switched out of context (e.g. when ALU operation has to wait for a TMU operation).

To switch context you need to keep a copy of the registers and other variable stuff that's specific to the command thread, e.g. texturing results and loop indexes.

So you need a block of RAM for thread-context swapping.

I suppose it makes most sense to make the reservation stations a small bit of RAM that's nothing more than pointers into global RAM, where the command threads are found, accompanied by all the context.

In theory you could store these pointers in a block of memory that functions like registers. In theory the pointers are all the same size, and the length of the list of command threads is fixed. So why not build a fixed block of on-chip RAM to do that. That way you've got practically zero latency for arbitration.

Jawed

j^aws · Apr 5, 2005

Jawed said:
...
In theory you could store these pointers in a block of memory that functions like registers. In theory the pointers are all the same size, and the length of the list of command threads is fixed. So why not build a fixed block of on-chip RAM to do that. That way you've got practically zero latency for arbitration.

Jawed

That's what I meant, sorry my post wasn't clear. When I say local SRAM, I actually mean on-chip memory not system RAM. Too much time with CELL I guess and it's on-chip local SRAM memories!

I was thinking of a layout like below for the IC and it's 4 US,

4 US units, each with unified local SRAM (i.e. unified VS + PS Reservation Stations) and then a large pool of Cache feeding them. An on-chip memory controller keeping track of everything including eDRAM.

-------

Also need to refine the US unit again as I've realised some additional info has been ommited. At the last count,

US = (4 vALU + 4 sALU) + (4 tALU + 4 TMU)

I need to revise this because I dug up the full quote from the leak,

Leak said:
The Xenon GPU is a custom 500+ MHz graphics processor from ATI. The shader core has 48 Arithmetic Logic Units (ALUs) that can execute 64 simultaneous threads on groups of 64 vertices or pixels. ALUs are automatically and dynamically assigned to either pixel or vertex processing depending on load. The ALUs can each perform one vector and one scalar operation per clock cycle, for a total of 96 shader operations per clock cycle. Texture loads can be done in parallel to ALU operations. At peak performance, the GPU can issue 48 billion shader operations per second.

http://www.beyond3d.com/forum/viewtopic.php?t=13470

I forgot about this when reading the patent...

So the 48 ALUs are identical and can execute 96 shader ops per cycle via concurrent scaler and vector ops per ALU.

The other assumption I made was that 64 threads = 64 processing units, when it's actually the minimum processing units. In this respect we can still keep the texture addressing tALU alongside the TMU. So with each of the 48 ALUs capable of scaler and vector ops simultaneously,

US = 12 ALU + (4 tALU + 4 TMU)

R500 = 4 US

...with 12 shading ALU units on one side of the Arbiter and 8 processing units on the other.

Still something smells fishy!

Jawed · Apr 5, 2005

OK, I've decided to read the bloody patent and see what else I can find.

First thing I find is this:

As such, there is a need for a sequencing system for providing for the processing of multi-command threads that supports an unlimited number of dependent texture fetches.

So it appears that a major goal is to perform dependent texture fetches over and over, in a loop, or simply repeatedly. i.e. a command thread to the graphics processing engine might consist of, say, three dependent texture fetches one after the other, before returning a result to the reservation station.

Presumably dependent texture fetches are limited with current GPUs. If so, what would you do with unlimited dependent texture fetches?

Also, it seems to me that the ALU to calculate the address needs to support both vector and scalar calculations. Does that make sense?

Anyway, I'm guessing that there's a 4D+1D ALU in the TMU.

Jawed

aaronspink · Apr 5, 2005

Jaws said:
There would be 4 unified shader (US) units/cores in the R500,

US = 12 ALU + 4 TMU

If each thread can process 64 pixels or vertices, then the Arbiter scheduling to 12 ALUs doesn't seem to fit. Also the leak mentions "16 PS input interpolates per cycle" which suggests that those 48 ALUs aren't identical and there are 16 specialized ALUs. I'll call them sALU for want of a better word and these would be better paired with the TMU. So,

US = 8 ALU + (4 sALU + 4 TMU)

...so you have 8 processing 'units' either side of the Arbiter and 8 threads scheduled either side of the Arbiter and 16 threads per US unit.

This would suggest that 1 processing 'unit' (ALU, sALU or TMU) could work on a thread with 64 pixels or vertices.

And a total of 4 US units in the R500 would be linked via a bus, sharing their reservation stations of pixel and vertex threads fed by a shared cache and local eDRAM. Each US unit would work on 16 threads, i.e. 16-way SMT cores making the R500 capable of maintaining 64 threads.

Just some random thoughts!

Hmm, why 4? It seems to me that a more interesting design point would be 16 units, each with 3 ALU resources and 1 TMU resource. There is nothing that says you have to 4 pipes per quad.

One pipe per quad, running is 4 pass would probably be more efficient, especially for the type of workloads underdiscusion. Also the scheduling would be easier with only 4 resources to schedule per unit vs 16. I know how to design a 4 resource scheduler, a 16 resource scheduler is a monster.

Aaron Spink
speaking for myself inc.

Jawed · Apr 5, 2005

The patent talks about odd and even clocks and using 4 stage interleaving for each.

ALU arbitration proceeds in the same way as fetch arbitration. The ALU arbitration logic chooses one of the pending ALU clauses to be executed. The arbiter selects the command thread by looking at the reservation stations, herein vertex and pixel reservation stations, and picking the first command thread ready to execute. In one embodiment, there are two ALU arbiters, one for the even clocks and one for the odd clocks. For example, a sequence of two interleaved ALU clauses may resemble the following sequence, (E and O stands for Even and Odd sets of 4 clocks) Einst0 Oinst0 Einst1 Oinst1 Einst2 Oinst2 Einst0 Oinst3 Einst1 Oinst4 Einst2 Oinst0. As such, this way hides the latency of 8 clocks of the ALUs. Moreover, the interleaving also occurs across clause boundaries, as discussed in greater detail below.

So rather than interleaving four pixels of a quad in time, it appears that this architecture executes blocks of 4 ALU instructions for 2 objects (pixels or vertices). If a command thread's length isn't divisible by 4 the arbiter can chuck in another command thread when needed.

Well, that's how I'm reading it.

I'm dubious about retaining a quad-pixel organisation simply because you're throwing ALU power down the drain (and vertices don't come in 4s). Dynamic branching is still fundamentally a horrible hog.

What's the point of going to all this effort to keep all the ALUs chomping code if you're gonna let dynamic branching screw it all up?

Jawed

j^aws · Apr 6, 2005

Jawed said:
OK, I've decided to read the bloody patent and see what else I can find.

First thing I find is this:

As such, there is a need for a sequencing system for providing for the processing of multi-command threads that supports an unlimited number of dependent texture fetches.

Click to expand...

So it appears that a major goal is to perform dependent texture fetches over and over, in a loop, or simply repeatedly. i.e. a command thread to the graphics processing engine might consist of, say, three dependent texture fetches one after the other, before returning a result to the reservation station.

Presumably dependent texture fetches are limited with current GPUs. If so, what would you do with unlimited dependent texture fetches?

AFAIK, dependent texture limits are removed for SM3.0. Removing the limit will allow you to write more advanced shaders. I came across a recent example for 'distance mapping', aka per pixel displacement mapping with distance functions. It currently doesn't work with GPU's with that read limit,

http://www.beyond3d.com/forum/viewtopic.php?t=20571

Also, it seems to me that the ALU to calculate the address needs to support both vector and scalar calculations. Does that make sense?

Anyway, I'm guessing that there's a 4D+1D ALU in the TMU.

Jawed

Not sure...if the texture addressing/ fetching/filtering algorithms have mixed scalar and vector elements then it would help to have 5D.

j^aws · Apr 6, 2005

aaronspink said:
Jaws said:

There would be 4 unified shader (US) units/cores in the R500,

US = 12 ALU + 4 TMU

If each thread can process 64 pixels or vertices, then the Arbiter scheduling to 12 ALUs doesn't seem to fit. Also the leak mentions "16 PS input interpolates per cycle" which suggests that those 48 ALUs aren't identical and there are 16 specialized ALUs. I'll call them sALU for want of a better word and these would be better paired with the TMU. So,

US = 8 ALU + (4 sALU + 4 TMU)

...so you have 8 processing 'units' either side of the Arbiter and 8 threads scheduled either side of the Arbiter and 16 threads per US unit.

This would suggest that 1 processing 'unit' (ALU, sALU or TMU) could work on a thread with 64 pixels or vertices.

And a total of 4 US units in the R500 would be linked via a bus, sharing their reservation stations of pixel and vertex threads fed by a shared cache and local eDRAM. Each US unit would work on 16 threads, i.e. 16-way SMT cores making the R500 capable of maintaining 64 threads.

Just some random thoughts!

Click to expand...

Hmm, why 4? It seems to me that a more interesting design point would be 16 units, each with 3 ALU resources and 1 TMU resource. There is nothing that says you have to 4 pipes per quad.

One pipe per quad, running is 4 pass would probably be more efficient, especially for the type of workloads underdiscusion. Also the scheduling would be easier with only 4 resources to schedule per unit vs 16. I know how to design a 4 resource scheduler, a 16 resource scheduler is a monster.

Aaron Spink
speaking for myself inc.

Well my first impressions were also to have 16 US units, discussed earlier in the thread. It seems to make more sense to have 16 US unit to be 4-way SMT than 4 US units that are 16-way SMT each.

Option 1:

US = 3 ALU + (tALU + TMU)

R500 = 16 US

Option 2:

US = 12 ALU + 4(tALU + TMU)

R500 = 4 US

...however, the leaked diagram does mention quads, i.e "2 2*2 pixel quads + Z/stencil" being outputed. Which inclined me more towards option2.

But thinking about this again, those pixel quads maybe only applicable to the 'other' side of the Arbiter, i.e. the tALU + TMU...?

j^aws · Apr 6, 2005

Jawed said:
The patent talks about odd and even clocks and using 4 stage interleaving for each.

ALU arbitration proceeds in the same way as fetch arbitration. The ALU arbitration logic chooses one of the pending ALU clauses to be executed. The arbiter selects the command thread by looking at the reservation stations, herein vertex and pixel reservation stations, and picking the first command thread ready to execute. In one embodiment, there are two ALU arbiters, one for the even clocks and one for the odd clocks. For example, a sequence of two interleaved ALU clauses may resemble the following sequence, (E and O stands for Even and Odd sets of 4 clocks) Einst0 Oinst0 Einst1 Oinst1 Einst2 Oinst2 Einst0 Oinst3 Einst1 Oinst4 Einst2 Oinst0. As such, this way hides the latency of 8 clocks of the ALUs. Moreover, the interleaving also occurs across clause boundaries, as discussed in greater detail below.

Click to expand...

So rather than interleaving four pixels of a quad in time, it appears that this architecture executes blocks of 4 ALU instructions for 2 objects (pixels or vertices). If a command thread's length isn't divisible by 4 the arbiter can chuck in another command thread when needed.

Well, that's how I'm reading it.

I read it differently as an example of one embodiment. Here they seem to be describing the syncing between TWO seperate Arbiters...

I'm dubious about retaining a quad-pixel organisation simply because you're throwing ALU power down the drain (and vertices don't come in 4s). Dynamic branching is still fundamentally a horrible hog.

What's the point of going to all this effort to keep all the ALUs chomping code if you're gonna let dynamic branching screw it all up?

Jawed

See reply to AS above.

AFAIK, the SM3.0 spec requires more flexible branching. As GPUs become more general purpose compuing devices and more like CPUs, this is inevitable. There will always be a trade-off between general purpose and fixed function for a given transistor budget.

Jawed · Apr 6, 2005

Jaws said:
Jawed said:

Also, it seems to me that the ALU to calculate the address needs to support both vector and scalar calculations. Does that make sense?

Anyway, I'm guessing that there's a 4D+1D ALU in the TMU.

Click to expand...

Not sure...if the texture addressing/ fetching/filtering algorithms have mixed scalar and vector elements then it would help to have 5D.

I'm thinking of dependent texturing in the vertex shader. But that's a guess.

Jawed

Jawed · Apr 6, 2005

Jaws said:
I read it differently as an example of one embodiment. Here they seem to be describing the syncing between TWO seperate Arbiters...

But the output of the two arbiters is shared by one pair of ALU/graphics processing engines.

AFAIK, the SM3.0 spec requires more flexible branching. As GPUs become more general purpose compuing devices and more like CPUs, this is inevitable. There will always be a trade-off between general purpose and fixed function for a given transistor budget.

I'm thinking that it's better to have one command thread stall upon a branch, than to have all four command threads in a quad run the SUM(of the length) of all code paths through the branch.

A branch stall in this architecture appears to last for 3 cycles. If a loop is un-rolled into 4-instruction clauses then you'll only get a stall when the loop terminates. If both pixels (interleaved, going into the same ALU) run the loop the same or a different number of times, then you'll get 2x3 cycles of stall. It's a fixed wasted-cycle cost for the loop, no matter how much difference in loop execution length there is for the two pixels.

If a four-pixel quad runs the loop the same number of times for all pixels then you'll get a 7-cycle stall (time to fill the ALU pipeline after the code exits the loop, assuming the pipeline consumes 8 cycles to execute each instruction). But if any pixel runs the loop for longer, then you get a stall that lasts as long as the extra cycles for that loop multiplied by the number of pixels that have already completed the loop.

Obviously the last thing you should do is execute extremely short loops very few times (or very short If, or Else clauses) because then you'll be thrashing the command thread arbiter and generating very low ALU utilisation.

Pixel quads are a win when all pixels in the quad run a dynamically branched loop the same number of times. But that's the only way a pixel quad and this kind of loop can work efficiently. Why are you even using a dynamically branched loop in that case?...

Jawed

Jawed · Apr 6, 2005

Jaws said:
But thinking about this again, those pixel quads maybe only applicable to the 'other' side of the Arbiter, i.e. the tALU + TMU...?

Seems to me that you can have logical pixel quads, where the graphics processing engine groups texture accesses into quads (or more?) of texture memory locality.

But I think the heavy level of command thread switching and varying clause-length ALU pipeline-packing means that a pixel quad gets serialised for submission to the ALU engine.

The quad becomes 2 pixels as an interleaved pair, followed by the other two pixels as an interleaved pair. The filtered texels that the quad is dependent upon will have been generated already by the graphics processing engine and safely stashed in EDRAM ready for the corresponding command threads (4) to get their next run in the ALU core.

With the arbiters using aging and a FIFO organisation for command thread scheduling, a "quad" of pixels entering the reservation station is going to exit the reservation station as a "quad".

Unless there's some dynamic branching in the code

Jawed

aaronspink · Apr 6, 2005

Jaws said:
...however, the leaked diagram does mention quads, i.e "2 2*2 pixel quads + Z/stencil" being outputed. Which inclined me more towards option2.

But thinking about this again, those pixel quads maybe only applicable to the 'other' side of the Arbiter, i.e. the tALU + TMU...?

Its not the other side of the arbiter. Think about the interface to the edram. Think of it as the number of outputed pixels per cycle. also 6600, etc. There is the number of pixel you can work on in a cycle and the number of pixels you can output in a cycle. They may or may not be directly linked.

Taking it back to CPUs, there are plenty of CPUs that can execute >4 and upwards of 11 instructions per cycle, but can only retire ("output") 3-4 instructions per cycle.

Aaron Spink
speaking for myself inc.

overclocked · Apr 6, 2005

We can probably count on 1/4 TMU vs shaderops in this logic. SeemÂ´s fair for future games and will for certain decrease the burden of the assumed low ram in the NGens.

psurge · Apr 7, 2005

What about disallowing control flow (besides predicated instructions) and texture ops inside an ALU clause? Basically an ALU clause would look like this :

Code:

{
    math_op1
    math_op2
    ...
    math_opN
    [texture_op] 
    [control_flow_op]
}

A thread (in this case per pixel or per vertex) is returned to the reservation station at the end of each clause. You have enough information available to determine:
- whether the thread is waiting for a texture
- which clause it should continue with, and whether or not this
clause has been locally cached

Going with aaronspinks 3 ALUs and 1 TMU per arbiter/reservation station,
couldn't the arbiter issue 3 ALU threads asking for the same ALU clause?
That way you are still sharing decode/issue logic, but threads can issue out of quad order, meaning that you only pay a branch penalty when a particular branch is taken by less than 3 pixels/verts.

Does any of this sound reasonable?

ATI R500 patent for Xenon GPU?

Similar threads