Xenos and R520/R580 - Design Differences

The overarching design goal that permeates both architectures, Xenos and R520/R580 alike, is that of dynamically maximizing available resources according to processing load. Both architectures have been prepared to realize what I call "dynamic efficiency" in subtlely different ways.

Since we're many of you are familiar with Xenos' and R520/R580's architectural configuration, I ask: how do the design protocols of each allow for "dynamic effieciency" and how do their approaches differ?

Correct me if I'm wrong, but fundamentally, R580 and R520 offer a set of 8 MIMD vertex processors and 4 SIMD pixel processors operating independetly, in MIMD fashion, fed by a load balancing thread dispatch unit (scheduler). It offers a large register space to allow for many values for instructions that are in flight. I'm not sure whether the same scheduler feeds both the vertex and pixel pipes or whether there is a scheduler dedicated to each MIMD node, but if this is the case, R580 offers 4 scheduling/dispatch units for the pixel processors and 8 for the vertex processors (if each vertex unit operates as an independent unit). Within each SIMD pixel processor, R520/580 has 4 texture samplers and address processors available to it, although I'm not sure if they can operate independently of the ALU processors within the SIMD quad, with instructions issued to them independently in their own thread (although it would only be 1 thread for all four). In addition, R580 offers a ring bus and a programmable and dynamically adaptable memory controller with 32-bit granularity. The batch size allocated to each pixel thread in flight processors is relatively small, which I know affects the architectures ability to make efficient use of its units and handle dynamic branching.

How does Xenos' processing configuration differ from the above? What are the ramifications Xenos' approach as opposed to R520/R580's?
 
Last edited by a moderator:
Read the Xenos article.

A few quick notes: VS in R5xx are not MIMD, they are still SIMD. Xenos doesn't feature the ring bus - its not big enough to really need it, was the reason give. Xenos's texture units will serve any of the 3 shader arrays.
 
Within each SIMD pixel processor, R520/580 has 4 texture samplers and address processors available to it, although I'm not sure if they can operate independently of the ALU processors within the SIMD quad, with instructions issued to them independently in their own thread (although it would only be 1 thread for all four).
Can somebody verify whether the texture units operate independently of the ALUs (in their own thread) within a pixel quad unit?
 
Hijackin' this thread.

Over in the console talk forum (probably wrong place, but I dislike creating redundant threads) I asked two questions regarding Xenos' capabilites versus R520/RV530, notably in dynamic branching (/flow control):

http://www.beyond3d.com/forum/showthread.php?p=627543#post627543
TurnDragoZeroV2G said:
On the subject of Xenos' dynamic branching, I have a question there. I've lurked around long enough that I've read a fair majority (or maybe not, heh) of threads on Xenos, but I don't know if this has ever been addressed (simple yes or no would send me packing into search-wonderland... assuming this isn't a very large/huge misunderstanding on my part instead): From Dave's article:

Because the shader arrays are operating on threads larger than a quad, a grouper and scan converter are needed here. These two units batch up blocks of vertices or triangles that each have the same state (i.e. they will have the same properties, hence shader programs, attached to them) in order to maximise the batch. Where we often consider traditional pixel pipelines to be operating on pixel quads in individual triangles in a pipeline, this is not the case with Xenos - the processors will be operating over 4 2x2 quads of pixels over multiple triangles of the same state so that small triangles don't destroy the efficiency as they are batched together. Of course, there will be some processor element wastage at the edges of triangle batches (although the texture sampling efficiency increases in these cases).

Xenos is supposed to work on batches of 8x8 pixels. But, it seems like it could actually batch up 2x2 pixels from any triangles in the scene, potentially, into the same batch. The r520 article's picture illustrating dynamic branching implies that all the pixels in the batch are adjacent to eachother and that the smaller batch size is what allows it to achive greater efficiency. But if Xenos is batching up pixel quads from multiple triangles, wouldn't this be the equivalent of making some batches out of a few 2x2 blocks in the shadow, some in the "grey," and then some in the full light? In which case, DB efficiency would essentially go down the toilet? Is my understanding of this batching totally off, or is there alot more logic that goes into properly batching things together (or is it that all the pixels in the 64-pixel batch are adjacent to one another, even if they lie on multiple triangles... :???: )

And
TurnDragoZeroV2G said:
I have no reason to doubt it has dynamic branching (though perhaps there's the question of whether it has a branch execution unit like r520?). My question is more about: does the way Xenos works make it even less efficient that its batch size might indicate.

Here (some of) Xenos' capabilities are thrown on the table

Under the dynamic flow control depth, xenos has listed "4 for loops/calls, 2^23 if nesting." What differentiates the two, and how exactly does this compare to the flat-out "24" of SM3.0? Is this a case of Xenos falling short of or exceeding SM3.0 spec?

Any explanation, links to sites/posts explaining, or guarantee the answer is somewhere in search wonderland (or worse, in the article :p, and in either case feel free to smack me) would be cool.
 
I would also enjoy reading an answer to those questions, since I didn't understand some of those very aspects covered in the article.
 
Luminescent said:
Can somebody verify whether the texture units operate independently of the ALUs (in their own thread) within a pixel quad unit?

ALUs and texture operations are parallel, so independant. Well, as long as there is no dependancy :)

Also, since there's heavy pipelining, talking about a "texture unit" is really talking about 100's of threads of work, being worked on, or waiting for their return data.
 
Back
Top