ATI R500 patent for Xenon GPU?

one said:
Unified ALUs will be nice to save a silicon budget but it seems to work optimally only when the arbiter is intelligent enough. I assume you can directly use them in the manual configuration mode by assigning the fixed number of Pixel/Vertex Shader units, too.

OK explain to me why it needs to be smart?

You can't do too much pixel processing because you need to have transformed the the verts first and you can't do too much vertex processong because you need to store the results.

So all the arbitor has to do is keep the ALU's in use, parcel them out on a demand basis. If the vertex Fifo is full, them they all go to rendering, if it's empty they transform.

The reason this is a relatively simple problem (assuming the intervening fifo is large enough) is that both tasks are linked.
 
That's poor use of bandwith, you want to dynamically adjust the balance ... not go full on/off. Still, shouldnt be hard.
 
Actually I believe that it's likely to swap 100% either way on a regular basis. Doing a Z pass requires no pixel shader work and rendering large polygons almost no VS work.

I'm sure the algorithm is somewhat more complex than the one I described, but my point is it's not possible to have one thread group outstrip the other since they are in a pipeline.

I just don't see why people think engineers would want control over the ALU distribution. It would largely defeat the purpose.
 
ERP said:
Actually I believe that it's likely to swap 100% either way on a regular basis. Doing a Z pass requires no pixel shader work and rendering large polygons almost no VS work.

I'm sure the algorithm is somewhat more complex than the one I described, but my point is it's not possible to have one thread group outstrip the other since they are in a pipeline.

I just don't see why people think engineers would want control over the ALU distribution. It would largely defeat the purpose.

What about non-graphics threads, e.g. general purpose vector threads like physics? Wouldn't you want control then or would they just be another vertex thread?
 
Wouldn't make any difference - both vertex and pixel threads under this structure have the same capabilities, so whether you define it as a pixel or vertex thread appears to be neither here no there. What command queue it ends up in will be down to how the programmer defines it, but its arbitary if each stack has the same capabilities.
 
nAo said:
Oh well..do you mean what if nvidia has an IC dedicated to geometry and another one dedicated to pixels?!
Where did you get this idea about multichip impementations?

Just an impression I'm getting from the noises that are being made. Whether sooner or later I don't know.
 
DaveBaumann said:
nAo said:
Oh well..do you mean what if nvidia has an IC dedicated to geometry and another one dedicated to pixels?!
Where did you get this idea about multichip impementations?

Just an impression I'm getting from the noises that are being made. Whether sooner or later I don't know.

Stop it...you're going to summon 3dfx / Sage / Rampage talk... ;)
 
Jaws said:
ERP said:
Actually I believe that it's likely to swap 100% either way on a regular basis. Doing a Z pass requires no pixel shader work and rendering large polygons almost no VS work.

I'm sure the algorithm is somewhat more complex than the one I described, but my point is it's not possible to have one thread group outstrip the other since they are in a pipeline.

I just don't see why people think engineers would want control over the ALU distribution. It would largely defeat the purpose.

What about non-graphics threads, e.g. general purpose vector threads like physics? Wouldn't you want control then or would they just be another vertex thread?


Makes no difference.

OK think of it another way.

I have X amount of ALU ops I must perfrom per frame, some are vertex, some are pixel and some are generic (say physics) as long as if I have work to do I do it somewhere it doesn't really matter how I allocate an ALU in a given instant, I'm going to do the same amount of work no matter what....

In the real world things are more complex, you have queues of "threads" and just parcel out the ALU's based on some hueristic to whatever queue is best.

Now this does assume that ALU ops are the bottleneck, in the case of large simple polygons it's possible you could be fill constrained, the vertex output fifo would fill there would be no work for the pixels and ALU's will be idle in this case.
 
nAo said:
What Inane_Dork mentioned could be used to defer rendering, but the point is why would you do that?
If you're going after a TBDR you need to have very special care for all the information you'd need to save and re-use, so even if it could be done 'that' way it doesn't mean it would efficient. How much bandwith is needed to restore the thread state? it would be feasible to do that with external memory? Dunno..
To save memory I'd prefer to split the viewport n times, render n viewports and do a final composite pass.
Moreover once you have a big pool of edram and you have designed your GPU around that you already have a lot of advantages a TBDR has, like multisampling AA (almost) for free.
Features that a TBDR can provide as no overdraw and automatic non-opaque fragments sorting would be nice to have but these things doesn't come for free once you have deferred the rendering phase ;)
Hey, I did say it was a crazy idea. :p

For the purpose of fitting into cache, I would guess multiple viewports to be better if you do it correctly. The only advantage I can think of with this is that the "fitting into cache" algorithm becomes a hardware solution.


I just realized that this thing might be more useful with parallelization hints from the programmer. Drawing several opaque cubes, for instance, could all be in these reservations stations simultaneously without a problem. Drawing particle effects, however, should be done back to front. It would be really cool if the priority of the pixel threads in such a case were based on the distance from the camera. That way, the developer supplies n particles and the hardware depth sorts for him.

Heh. Crazy idea #2. :p
 
Joe DeFuria said:
Stop it...you're going to summon 3dfx / Sage / Rampage talk... ;)

Well, given the guy that designed large parts of NV40's shader core and presumably lead the design was also the guy that headed up the Rampage development at 3dfx those types of influences in the company must be fairly large. [I need an "eyebrow raise" smiley ;)]
 
To make a transistion like that (from a single IC to multiple ICs) that ex 3DFX guy should have pretty convincing arguments..
Anyway, it's nice to know I'm not the only one that heard something along the lines of nvidia going dual chip.
It was told me about G70 or G80, even if I was (and I am..) really tempted to tag it as an unfounded rumour.
 
ERP said:
I have X amount of ALU ops I must perfrom per frame, some are vertex, some are pixel and some are generic (say physics) as long as if I have work to do I do it somewhere it doesn't really matter how I allocate an ALU in a given instant, I'm going to do the same amount of work no matter what....

Aren't you forgetting about latency and limits to how much stuff can be buffered? For example, let's say you run the shaders until all shader threads are blocked on memory I/O, then you switch to processing vertex shader requests. If there are no vertices buffered up and ready to go to the ALUs, then there will be a wait, especially if it has to read the vertex buffers from video memory, which is already contended with hundreds of outstanding texture reads. There means there could be a delay if the "vertex station" is empty.

The only way this scheme works without introducing lost cycles/pipeline bubbles is if the stations feeding the arbiter can never run dry, but if you run an "all on/all off" arbitration strategy, it seems to me that one of the two *can* be empty, in which case, there may not be any work ready for the arbiter to hand out *in the next cycle*


On the other hand, I must say, I disappointed that this kind of stuff can get patented. I've read design books that have similar block diagrams in them for pooling functional units. The patent should be more specific to the function of the arbiter IMHO.
 
DemoCoder said:
ERP said:
I have X amount of ALU ops I must perfrom per frame, some are vertex, some are pixel and some are generic (say physics) as long as if I have work to do I do it somewhere it doesn't really matter how I allocate an ALU in a given instant, I'm going to do the same amount of work no matter what....

Aren't you forgetting about latency and limits to how much stuff can be buffered? For example, let's say you run the shaders until all shader threads are blocked on memory I/O, then you switch to processing vertex shader requests. If there are no vertices buffered up and ready to go to the ALUs, then there will be a wait, especially if it has to read the vertex buffers from video memory, which is already contended with hundreds of outstanding texture reads. There means there could be a delay if the "vertex station" is empty.

The only way this scheme works without introducing lost cycles/pipeline bubbles is if the stations feeding the arbiter can never run dry, but if you run an "all on/all off" arbitration strategy, it seems to me that one of the two *can* be empty, in which case, there may not be any work ready for the arbiter to hand out *in the next cycle*


On the other hand, I must say, I disappointed that this kind of stuff can get patented. I've read design books that have similar block diagrams in them for pooling functional units. The patent should be more specific to the function of the arbiter IMHO.

OK for the last time I'm not promoting all on all off, clearly you'd want something more complex than that, and yes I was grossly simplifying the problem. Clearly you have to have queues of work always available and somewhere to put the results.

I just don't think you need anything really complicated (from an algorithmic standpoint) to make this work and if it does work the last thing you'd want is someway for an engineer to everride it.

In a real solution you'd probably want to favor pixel work over vertex work, and have a high/low watermark on the intermediate fifo that would transition work units over on some sliding scale. But even this might be overkill if the latency is predictable.
 
http://www.beyond3d.com/index.php#news21716

The patents made B3D frontpage headline news! :p



Some further speculation on the unified shader unit, [US].

The First and Second reservation stations above are 'local memories'. It would make sense to have a cache heirarchy to the arbiter like a 'shared L1 cache' which would feed these 'local memories'. Also this L1 cache would have two-way communication to the Xenon CPU cores shared L2 cache,

[CPU: L2 Cache]<=>[L1 Cache: Arbiter US]

This would fit nicely with the supposed leaked specs, no?! :p
 
xbox2_scheme_bg.gif


Looking at the alleged leaked diagram above and the GPU, item (5),

- the 16 bilinear texture fetches per cycle implies => 16 Texture Units (TMU)

...and from Daves link above re leaked spec,

"The shader core has 48 Arithmetic Logic Units (ALUs) that can execute 64 simultaneous threads on groups of 64 vertices or pixels. ALUs are automatically and dynamically assigned to either pixel or vertex processing depending on load."

Therefore some assumptions on R500

1. The R500 has 48 ALUs and 16 TMUs

2. The R500 is 64-way SMT

3. Each unified shader (US) unit has a 3:1 ALU:TMU ratio

4. The US arbiter can schedule 3 ALU threads and 1 TMU thread

5. The 3 ALU threads are composed of vertex, pixel or general vector threads.

Just some thoughts! :p

EDIT:

This would suggest 16 unified shader units for the R500.
 
I don't think you should group TMUs with ALUs since it doesn't seems there's any kind of coupling,
at least from leaked specs. We don't even know if some ALUs is used in unjunction with some other unit to have a TMU,
or if there are 48 indipendent ALUs and 16 indipendent TMUs
Moreover I wouldn't call R500 a 64-way SMT architecture cause I doubt threads granularity is so fine.
NV40 batches quad in groups of about 1000 pixels, even if R500 is probably much more advanced I don't think 1 thread = 1 pixel or 1 vertex.
IF you want to (texture fetch) hide latency with some many pixels and vertex in flight you'd need to process much more than 64 pixels at the same time (not at the same clock)
 
xbox2patent_01.gif


This is the Xenon patent.

If the R500 will 16 US units, each with 3 ALUs and 1 TMU, what are the possible internal bus arrangements?

The US units can comminucate internally with the usual toplogies,

1. Parallel bus

2. Star bus

3. Ring bus

They'd need to maximise bandwidth and minimise latency, so a ring bus could be used like CELL with it 8 SPE's. Otherwise a star or parallel bus with the 16 US units clustered closely, i.e. 2*8 or 4*4 clusters of US units with a shared cache feeding them and local access to eDRAM.
 
nAo said:
I don't think you should group TMUs with ALUs since it doesn't seems there's any kind of coupling,
at least from leaked specs. We don't even know if some ALUs is used in unjunction with some other unit to have a TMU,
or if there are 48 indipendent ALUs and 16 indipendent TMUs
Moreover I wouldn't call R500 a 64-way SMT architecture cause I doubt threads granularity is so fine.
NV40 batches quad in groups of about 1000 pixels, even if R500 is probably much more advanced I don't think 1 thread = 1 pixel or 1 vertex.
IF you want to (texture fetch) hide latency with some many pixels and vertex in flight you'd need to process much more than 64 pixels at the same time (not at the same clock)

Okay, I wasn't trying to suggest any tight coupling between the 3 ALUs and a 1 TMU per US unit. This is where the arbiter decouples them as in the patent.

Granted there are other combinations and permutations with arbiter and ALU/TMU arrangements. It just made sense to have a US unit that could operate on 4 threads (3:1 ratio of ALU:TMU). They could just aswell be bigger US units with more ALUs and TMUs but in the same ratio.

The 64-way SMT can sound misleading. I suppose it would be more accurate to say, assuming the above, each US unit is 4-way SMT and there would be 16 of them to scale to 64 threads in total.
 
Will quad pixel processing be a part of this architecture? If so, how many TMUs per quad?

Jawed
 
Back
Top