ATI R500 patent for Xenon GPU?

j^aws

Veteran
Multi-thread graphic processing system

Abstract

The present invention includes a multi-thread graphics processing system and method thereof including a reservation station having a plurality of command threads stored therein. The system and method further includes an arbiter operably coupled to the reservation station such that the arbiter retrieves a first command thread of the plurality of command threads stored therein such that the arbiter receives the command thread and thereupon provides the command thread to a command processing engine. The system and method further includes the command processing engine coupled to receive the first command thread from the arbiter such that the command processor may perform at least one processing command from the command thread. Whereupon, a command processing engine provides the first command thread back to the associated reservation station.

and

...
14. A graphics processing system comprising: a pixel reservation station having a plurality of pixel command threads stored therein; a vector reservation station having a plurality of vector command threads stored therein; an arbiter coupled to the pixel reservation station and the vector reservation station; an arithmetic logic unit (ALU) operably coupled to the arbiter; and a texture engine operably coupled to the arbiter wherein the arbiter retrieves a first selected command thread from one of the plurality of pixel command threads and the plurality of vector command threads and the arbiter thereupon provides the first selected command thread to at least one of: the ALU and the texture engine.
...

and

...
BACKGROUND OF THE INVENTION

[0002] In a graphics processing system, it is important to manage and control multiple command threads relating to texture applications. In a typical graphics processing system, the processing elements, such as vertices and/or pixels, are processed through multiple steps providing for the application of textures and other processing instructions, such as done through one or more arithmetic logic units (ALU). To improve the operating efficiency of a graphics processing system, the control of the flow of the multiple command threads is preferred.
...

and

...
[0038] As such, the present invention allows for multi-thread command processing effectively using designated reservation station, in conjunction with the arbiter, for the improved processing of multiple command threads. The present invention further provides for the effective utilization of the ALU and the graphics processing engine, such as the texture engine, for performing operations for both pixel command threads and vertex command threads, thereby improving graphics rendering and improving command thread processing flexibility.
...

Multi-thread graphic processing system

Sounds like the load balancing control for vertex and pixel threads for the ATI R500, aka Xenon GPU...?
 
maybe thats how they are implementing unified shaders... that arbiter will determine on the fly based on prioritization of the multiple threads its receiving, how many ALUs (out of 48 ) to dedicated to pixel shading versus vertex... makes the ratio of one to the other completely fluid and dynamic...

Of course that is layperson understanding. :D
 
^^ Yep, sounds like it...

Patent said:
...
6. The command processing system of claim 1 further comprising: an arithmetic logic unit (ALU) operably coupled to the arbiter such that the arbiter is capable of providing at least one of the plurality of command threads to the ALU.
...

The ALU can receive either vertex or pixel threads from the arbiter...
 
Some thoughts (unordered):

1) The interleaving of threads on an ALU was an interesting twist I did not expect. Anything to boost the total number of instructions executed per second, I guess.

2) The introduction of multiple reservation stations makes things interesting. The obvious split is between pixel and vertex threads, but go further. Presume for a second that the X2 has eDRAM and this patent applies to the GPU in the X2. If one of the reservation stations is in memory, they might be able to do deferred tiled rendering. The point, of course, would be to only process vertices or pixels within the portion of the screen currently in the eDRAM. It's sort of a crazy idea, but the pieces fit together rather well.
 
Good find Jaws.

Inane_Dork, interesting comments. The R400 was a radical new design, and the R500 is a console follup part. I think most of us have assumed the radical new concept was Unified Shaders... but maybe there is more? Right now it seems memory is a huge limitation. Sure, proceedural textures may help some, but we have a ton of high quality features (high res normal maps, HDRL, high levels of AA) that most of us expect Next Gen hardware to do, yet the memory bandwidth is just not there.

Dave has hinted on some ideas about the R520 having some focus on memory issues. It would be really nice to get some real advancements into the new consoles, stuff that sets a high standard for years to come.

Again Jaws, good info there. Hopefully the patent turns into something beneficial.
 
Unified ALUs will be nice to save a silicon budget but it seems to work optimally only when the arbiter is intelligent enough. I assume you can directly use them in the manual configuration mode by assigning the fixed number of Pixel/Vertex Shader units, too.
 
It would be nice to know how much bigger is a unified vertex/pixel ALU than a custom vertex and a pixel ALU. The arbiter(s) is(are) going to be more complex than before so even this will eat some more transistors.
If this cost is cheap, say less than 5/10% of ALU die area, it could be a big win for ATI.
We should also note that a unified ALU could have some tradeoff between area and speed, so even if it doesn't get that much bigger it could be slower in some areas..
 
Maybe someone can correct me, but in theory does not dynamic load balancing ith unified shaders help with frame rate stability? I would think if they are implimented well they would improve the minimum frame rate because because hiccups associated with PS and VS bottlenecks could be smoothed over. Being able to dynamically load balance at any time, I would believe, make the framerate much more stable in shader limited games.

Another advantage is diversity--both in a single game and from game to game. Designers can deside to be pixel shader intensive or vertex shader intensive, or somewhere inbetween.

I really am excited to see what ATi crunched into the R500. It will be neat to see how it performs.
 
Acert93 said:
Maybe someone can correct me, but in theory does not dynamic load balancing ith unified shaders help with frame rate stability?
It may help in some cases but most of the time frate rate unstability is given by lack of fill rate (as with tons of huge transparent particles) or a sudden increase in vertex or pixel shading demand (without a decrease in the counterpart..)
I bet edram is more effective in reducing frame rate hiccups than unified ALUs ;)
 
nAo said:
I bet edram is more effective in reducing frame rate hiccups than unified ALUs ;)

I will take both please... and while you are at it, can you throw in some high performance TBR that Inane_Dork mentioned? ;)
 
Acert93 said:
I will take both please... and while you are at it, can you throw in some high performance TBR that Inane_Dork mentioned? ;)
What Inane_Dork mentioned could be used to defer rendering, but the point is why would you do that?
If you're going after a TBDR you need to have very special care for all the information you'd need to save and re-use, so even if it could be done 'that' way it doesn't mean it would efficient. How much bandwith is needed to restore the thread state? it would be feasible to do that with external memory? Dunno..
To save memory I'd prefer to split the viewport n times, render n viewports and do a final composite pass.
Moreover once you have a big pool of edram and you have designed your GPU around that you already have a lot of advantages a TBDR has, like multisampling AA (almost) for free.
Features that a TBDR can provide as no overdraw and automatic non-opaque fragments sorting would be nice to have but these things doesn't come for free once you have deferred the rendering phase ;)
 
nAo said:
If this cost is cheap, say less than 5/10% of ALU die area, it could be a big win for ATI.

If what I've been hearing is interpreted accuratly then unifed shading architectures (at the hardware level) are scheduled to go all the way to mobile phones, fairly soon (in mobile development terms) so I'm guessing the control unit isn't that sizeable.
 
DaveBaumann said:
If what I've been hearing is interpreted accuratly then unifed shading architectures (at the hardware level) are scheduled to go all the way to mobile phones, fairly soon (in mobile development terms) so I'm guessing the control unit isn't that sizeable.
I wasn't referring to ALU die area increase due to the more complex control unit but due to instrinc inefficiency in a design that try to solve 2 similar but different problems.
(obviously I'm talking about pixel and vertex shading here).
In a couple of years or so I expect nvidia to have GPUs with more ALUs than ATI GPUs for a given transistors budget,
with the ATI part beaing able to sustain a bigger IPC than the NVIDIA part that should sport more cores..
It will be an interesting battle ;)
As I already stated if ATI can master both problems without trading too much performance and die area they will win this battle, imho.
 
nAo said:
As I already stated if ATI can master both problems without trading too much performance and die area they will win this battle, imho.
On the paper, and therefore theoretically speaking, Nvidia would have to screw something up in order to be beat in both front. If Nvidia do everything correctly, same thing for Ati, each one should have its own forté.
 
nAo said:
I wasn't referring to ALU die area increase due to the more complex control unit but due to instrinc inefficiency in a design that try to solve 2 similar but different problems.

What do you think those ALU issues are?

In terms of ALU's, with both PS and VS being required to operate the same instructions (under a WGF2.0 or greater environment at least) which suggests that the only ALU differences between the two would be what you implement as native instructions and what you implement as being capable in macro's - what instructions would benefit more from one type of processing than another? Does it also mean that every ALU is necesarily exactly the same?

David Kirk highlighted texturing demands as one issue, however looking at this it seems that this could be negated (as was indicated in the reply we sourced from ATI in asnwer to Kirks comments) as there is a separate texture and ALU command queues - this also seems alike a fairly neat way of avoiding texture latencies as the ALU operations (be they vertex or pixel) are being executed on the ALU pool whilst other commands are waiting for texture reads (the results of which can then just go back into the ALU command queue).
 
In terms of ALU's, with both PS and VS being required to operate the same instructions (under a WGF2.0 or greater environment at least) which suggests that the only ALU differences between the two would be what you implement as native instructions and what you implement as being capable in macro's - what instructions would benefit more from one type of processing than another?
Operating on the same instruction doesn't mean having the same behaviour, or even the same implementation (see x86 ISA history..)
Once I wrote in this forum that it seems nvidia want to have 2 basic designs: one to cope with high latency operations and another one to cope with low latency operations.
Since I'm not a hardware guy I'm really guessing here..but I expect these 2 designs to be quite different if one try to push the envelope and starts to make assumption (this is what I do as a software guy..)

Does it also mean that every ALU is necesarily exactly the same?
That's not the case, of course.

David Kirk highlighted texturing demands as one issue, however looking at this it seems that this could be negated (as was indicated in the reply we sourced from ATI in asnwer to Kirks comments) as there is a separate texture and ALU command queues - this also seems alike a fairly neat way of avoiding texture latencies as the ALU operations (be they vertex or pixel) are being executed on the ALU pool whilst other commands are waiting for texture reads (the results of which can then just go back into the ALU command queue).
Well..even an uneducated guy like me proposed something like that 2 o 3 years ago in a previous iteration of this forum so it's nothing groundbreaking.
The problem is..it sounds as a cool and relatively simple thing to do and if you look to nv40 pixel pipes nvidia is already doing something like that since one ALU out of two is used for texture sampling also, even if that design is not unified.
Nvidia has even a patent that is quite similiar to the one posted in this thread but they have different functional units for vertices and pixels feeded by a 'central' threads manager.
At the end of the game I don't thin future design by NVIDIA and ATI will be so different at functional level.
Nvidia could even switch to a full unified design very quickly once they shading model is unified too and assuming to extend the pixel pipe ALUs to the vertices processing.(and primitives assembling too!)
I say this with a reference to this patent:
System and method for reserving and managing memory spaces in a memory resource
System and method for reserving a memory space for multithreaded processing is described. Memory space within a memory resource is allocated responsive to thread type. Examples of thread types for graphics processing include primitive, vertex and pixel types. Memory space allocated may be of a predetermined size for a thread type. Memory locations within a first memory space may be interleaved with memory locations within a second memory space.

ciao,
Marco
 
Id take a 10% worst case slow down over the slow down something like texture sampling in the vertex shader is going to cause in an archtitecture optimized for vertex shaders using only low latency streamed data.
 
It would mean less performance than what you can achieve in a pixel shader, not just poor performance tout court ;) (as it is in current nv40 design)
 
Operating on the same instruction doesn't mean having the same behaviour, or even the same implementation (see x86 ISA history..)
Once I wrote in this forum that it seems nvidia want to have 2 basic designs: one to cope with high latency operations and another one to cope with low latency operations.

It strikes me that the high latency instructions are those that are dealing with textures – these are what Kirk has singled out as well. This is addressed in this scenario.

The problem is..it sounds as a cool and relatively simple thing to do and if you look to nv40 pixel pipes nvidia is already doing something like that since one ALU out of two is used for texture sampling also, even if that design is not unified.

Is the texture ALU able to operate whilst with the texture latency? (i.e. still interleave instructions whilst waiting to address the texture) It strikes me that having the independent texture address processor such as the R300 model is more similar.

Personally I’m beginning to think that NVIDIA’s reluctance to go to unified shaders is more driven by looking at a future involving multi-chip implementations than anything else.
 
DaveBaumann said:
It strikes me that the high latency instructions are those that are dealing with textures – these are what Kirk has singled out as well. This is addressed in this scenario.
Yeah..I was speaking about textures sampling, it was so clear to me I forgot to mention it.
Is the texture ALU able to operate whilst with the texture latency? (i.e. still interleave instructions whilst waiting to address the texture)
AFAIK yes.
It strikes me that having the independent texture address processor such as the R300 model is more similar.
Who cares if it's more or less similar if they're functionally doing the same thing and achieving the same results?
I don't want to blame nvidia if their architecture is not to similar to what propose and design another ihv ;) (if they achieve the same results, and at this time ATI and NVIDIA seem to be quit on par on many fronts, imho)
Personally I’m beginning to think that NVIDIA’s reluctance to go to unified shaders is more driven by looking at a future involving multi-chip implementations than anything else.
Oh well..do you mean what if nvidia has an IC dedicated to geometry and another one dedicated to pixels?!
Where did you get this idea about multichip impementations?
 
Back
Top