Anandtech - Inside the XBox 360 article

Another thing that has come to my attention is this:

http://www.beyond3d.com/reviews/ati/r520/index.php?p=03

The ALU breakdown in a ATI GPU pipe.

Note the first ALU has reduced functionality.

X360 gets 48 of the big ALU.

This could make it more powerful than it's ALU count seems. Each Xenos ALU might be 66% of a traditional ATI pipe say, instead of 50.

In the ALU 1 and ALU 2 breakdown's on the linked page, X360 gets 48 ALU 2's.

The implications are obvious. I've always figured two ALU's per pipe, so X360 is 24 pipes, but they have to be used for both functions. Wherea Nvidia has 24 pixel plus 8 vertex pipes.

But if they're say, 66% of a pipe, it's more like Xenos is 32 pipes..

The only thing I cant reconcile is how they kept the transistor count so low if that's the case, but I'm working on that.
 
Last edited by a moderator:
The simple reason why he says that R500's pipes are 0.5x R420 pipes is that the number of ALUs is different, and if I remember the R500 design properly (which I might not, been a while since I read B3D's article), that seems like a VERY rough guess considering the number of ALUs and not the actual kind of ALU or anything else.
Furthermore, comparing the R500 to the R420 and not the R520 is most original, because from that POV it's the same damn thing. Also, no comparaison with R520/R420 would properly show just how nice the dynamic branching & vertex texturing are on the Xenos.

As for Vec3/Vec4: there definitively are Vec4 operations in Pixel Shaders, just not quite as much as in the Vertex Shader, where you very often work on x,y,z and w at the same time. Also, the Vertex Shader tends to have slightly more scalar (x, y, z, OR w) operations, relatively speaking, but the PS can also have a bunch, that depends completely on what kind of thing it does at any given time.
I'm indeed not quite convinced by Vec4+Scalar for PS, but it can make things slightly faster than Vec3+Scalar. Also, Vec2+Vec2 is an horrible worst-case scenario for Vec4+Scalar, but that kind of operation is extraordinarly rare anyway, at least for the time being.


Uttar
EDIT: But remember about R520 vs R500 clocks on the same process: 625Mhz vs 500Mhz.
 
Last edited by a moderator:
Bill said:
Another thing that has come to my attention is this:

http://www.beyond3d.com/reviews/ati/r520/index.php?p=03

The ALU breakdown in a ATI GPU pipe.

Note the first ALU has reduced functionality.

X360 gets 48 of the big ALU.

This could make it more powerful than it's ALU count seems. Each Xenos ALU might be 66% of a traditional ATI pipe say, instead of 50.

In the ALU 1 and ALU 2 breakdown's on the linked page, X360 gets 48 ALU 2's.

The implications are obvious. I've always figured two ALU's per pipe, so X360 is 24 pipes, but they have to be used for both functions. Wherea Nvidia has 24 pixel plus 8 vertex pipes.

But if they're say, 66% of a pipe, it's more like Xenos is 32 pipes..

The only thing I cant reconcile is how they kept the transistor count so low if that's the case, but I'm working on that.

Yeap... by my observations while ATI Radeon R4xx and R5xx series had two shader units PER pixel pipeline but only ONE of those shader units was fully featured while the other one was a cut down version (half unit?)... it is not exactly known as to what exactly the differences are between those two except for the fact that it is not fully featured. I would surmise (and I am not totally sure) that the non-featured shader unit is used for normalization, much like how nVidia's shader units have those various "MiniALUs" for similar purposes. Looking at the claimed shader operations per second between the two video cards (x850XT and the Geforce 6800 Ultra as a comparison) seems to comfirm this as the Geforce 6800 Ultra is claimed to perform 51.2 Billion shader operations per second (at 400Mhz) versus ATI Radeon x850XT's claimed 43 Billion shader operations per second (at 540Mhz). So yes... only the second shader unit in the Radeon R42x/R52x really counts towards the shader performance in the Radeon cards. The Geforce cards (NV4x/NV5x) have two fully featured shader units, though the FIRST shader unit is coupled to a texture unit and if a texture operation is done you lose that first shader unit while the texture units in the Radeon cards are not... so it kinda evens out between the two video cards.

As for XENOS... each of the pipelines have fully featured shader units and they are designed to handle 5 component (Vec4+Scalar) instead of the 4 component (Vec3+scalar) that the Radeon cards handled in the pixel shader units. According to other articles those shader units also possess additional logic to handle the normalization.

Snipped from Dave's article...
Additional to the 48 ALU's is specific logic that performs all the pixel shader interpolation calculations which ATI suggests equates to about an extra 33% of pixels shader computational capability.

So yes... I tend to believe that XENOS is more akin to a 32-36 pixel pipeline Radeon R520 (as that is the only other ATI GPU that relies on increased ALU ultization from heavy multithreading) rather than a R420 as the shader units in XENOS and the x1000 series GPUs achieve a greater utilization rate compared to the R420/NV4x/NV5x shader units).

It is REALLY hard to say for sure without a more absolute comparison... in which case we will have to wait until ATI's and nVidia's next video cards at the end of NEXT year (R6xx/NV6x) before we can really do some comparisons.
 
Well, they did move 20-25million transistors of logic to the daughter die (and there's only half as many ROPs, which go for 4 samples but ditch programmable pattern for fixed 4-sample pattern, too, right?). And doesn't Xenos kinda lack a need for anything resembling Avivo, which would allow for some savings there?

On the subject of Xenos' dynamic branching, I have a question there. I've lurked around long enough that I've read a fair majority (or maybe not, heh) of threads on Xenos, but I don't know if this has ever been addressed (simple yes or no would send me packing into search-wonderland... assuming this isn't a very large/huge misunderstanding on my part instead): From Dave's article:

Because the shader arrays are operating on threads larger than a quad, a grouper and scan converter are needed here. These two units batch up blocks of vertices or triangles that each have the same state (i.e. they will have the same properties, hence shader programs, attached to them) in order to maximise the batch. Where we often consider traditional pixel pipelines to be operating on pixel quads in individual triangles in a pipeline, this is not the case with Xenos - the processors will be operating over 4 2x2 quads of pixels over multiple triangles of the same state so that small triangles don't destroy the efficiency as they are batched together. Of course, there will be some processor element wastage at the edges of triangle batches (although the texture sampling efficiency increases in these cases).

Xenos is supposed to work on batches of 8x8 pixels. But, it seems like it could actually batch up 2x2 pixels from any triangles in the scene, potentially, into the same batch. The r520 article's picture illustrating dynamic branching implies that all the pixels in the batch are adjacent to eachother and that the smaller batch size is what allows it to achive greater efficiency. But if Xenos is batching up pixel quads from multiple triangles, wouldn't this be the equivalent of making some batches out of a few 2x2 blocks in the shadow, some in the "grey," and then some in the full light? In which case, DB efficiency would essentially go down the toilet? Is my understanding of this batching totally off, or is there alot more logic that goes into properly batching things together (or is it that all the pixels in the 64-pixel batch are adjacent to one another, even if they lie on multiple triangles... :???: )
 
I just dont see how they kept the transistor count on Xenos so light.

It obviously has more ALU's than current ATI designs.

From R520 321m, yeah, you can knock maybe 30 off for controller and extra rops..

And yeah maybe 20 for Avivo. No idea really.

One thing is, R520 has a HUGE memory controller area. I dont see how Xenos can multi-thread as well with much less transistors, though.
 
Last edited by a moderator:
Bill said:
I just dont see how they kept the transistor count on Xenos so light.

It obviously has more ALU's than current ATI designs.

From R520 321m, yeah, you can knock maybe 30 off for controller and extra rops..

And yeah maybe 20 for Avivo. No idea really.

One thing is, R520 has a HUGE memory controller area. I dont see how Xenos can multi-thread as well with much less transistors, though.

For reference... the R520 pixel pipelines are identical to the pixel pipelines from the R420 in the fact that each pixel pipeline has one full featured shader unit and one "Half Unit"... except for the fact the texture units was removed from the pixel pipelines. I do not know if they improved the shader units like nVidia did with the NV5x as in allowing the shader units to operate 5 component operations (Vec4+scalar) instead of the 4 component (Vec3+scalar) that the R420 pixel shader units handled.

Because XENOS does not have that massive memory controller (Ring Bus) that the R520 has (which if I remember hearing from ATI correct is actually taking up the most space on the R520 currently)... the actual threading processor does not take that many transistors. I am still not clear on exactly how many threads each array can process at once as that number is under NDA. At least on the Radeon x850 each pixel quad (4 pixel pipelines) consumed roughly 20-25 million transistors each, equal or less than that on the R520 as the texture units was removed from the pixel pipelines, and roughly 30-35 million transistors per pixel quad (4 pixel pipelines) on the NV4x/NV5x (slightly more on the NV5x due to increased logic). The R520 is unusual in the sense that the shader core is actually in the minority in terms of transistor usage, but considering ATI will be reusing this controller for the next several GPU incarnations I would imagine why... this memory controller was designed for the future.

On the subject of dynamic branching on XENOS... indications are that it is present, but I don't have any solid information on that.
 
Last edited by a moderator:
I have no reason to doubt it has dynamic branching (though perhaps there's the question of whether it has a branch execution unit like r520?). My question is more about: does the way Xenos works make it even less efficient that its batch size might indicate.

But, while we're offtopic:

Here (some of) Xenos' capabilities are thrown on the table

Under the dynamic flow control depth, xenos has listed "4 for loops/calls, 2^23 if nesting." What differentiates the two, and how exactly does this compare to the flat-out "24" of SM3.0? Is this a case of Xenos falling short of or exceeding SM3.0 spec?
 
TurnDragoZeroV2G said:
Well, they did move 20-25million transistors of logic to the daughter die (and there's only half as many ROPs, which go for 4 samples but ditch programmable pattern for fixed 4-sample pattern, too, right?). And doesn't Xenos kinda lack a need for anything resembling Avivo, which would allow for some savings there?
Yep. Even counting all that, Xenos seems to have a "low" transistor count. And not to forget it actually has 64 shader pipes (any 16 from the 64 are given up for redundancy to improve yield).

I think the "simpler" shader ALU organisation of Xenos is prolly a big part of it. Not only does that cut out the ALU (which is supposedly just capable of ADD) but it also cuts out a heap of complex issue/decode circuitry that has to work out if the ADD can be dual-issued.

Additionally, Xenos saves transistors over conventional PC GPUs by lumping batches into 16-wide phases. The batch size in R520 is 16, but in four phases - on each phase a quad of pixels is processed (all running the same instruction). On Xenos a batch size is 64 pixels, again in four phases. By making the phases wider, like this, you use less transistors on the instruction fetch/issue/decode block - since you now have one of these blocks for each of 16 pipes. Whereas in PC GPUs, you have one of these blocks for each 4 pipes. So Xenos has the same number of these blocks as R520 does - but Xenos has four times as many pipes (ignoring R520's vertex pipes for a second). That's a big transistor saving over what you might expect.

Similarly, Xenos's texture pipes are treated as 16-wide, instead of four 4-wide quads. This means Xenos has one quarter of the texture-pipe decode logic (though I imagine that it's nowhere near as complex as the fetch/issue/decode logic required in the shader pipeline).

Xenos is supposed to work on batches of 8x8 pixels. But, it seems like it could actually batch up 2x2 pixels from any triangles in the scene, potentially, into the same batch.
Dave's article was written before it was known Xenos uses 64-sized batches.

The r520 article's picture illustrating dynamic branching implies that all the pixels in the batch are adjacent to eachother and that the smaller batch size is what allows it to achive greater efficiency. But if Xenos is batching up pixel quads from multiple triangles, wouldn't this be the equivalent of making some batches out of a few 2x2 blocks in the shadow, some in the "grey," and then some in the full light? In which case, DB efficiency would essentially go down the toilet?
Yes, Xenos will suffer lower DB efficiency than R520. R520 is a curiosity in this respect, because it's expected that all future ATI GPUs will increase the ALU:texture op ratio. In R520 it's 1:1. In R580 it's 3:1. So in R580, the batch size becomes 48, instead of R520's 16. So R580 (like Xenos) suffers a shortfall in DB efficiency compared with R520.

Is my understanding of this batching totally off, or is there alot more logic that goes into properly batching things together (or is it that all the pixels in the 64-pixel batch are adjacent to one another, even if they lie on multiple triangles... :???: )
A batch is formed of pixels that all have the same shader state. A shader state is defined by the need to run the same shader program. As far as I can tell, in ATI hardware this means the pixels must all come from the same triangle.

In vertex processing, the shader state effectively relates to vertex batches - all the vertices must be in the same batch. Since vertex batches are normally hundreds to tens of thousands (or more) in size, that's not a problem.

But it's arguable that dynamic branching in vertex shader programs is going to suffer a lot from the inefficiency of running in batches of 64. On the other hand the tessellation (creation, destruction or shifting) of vertices that Xenos supports may make this moot.

From the XFest documents I have Xenos has extra tricks up its sleeve to do with dynamic branching. These are instructions that allow the dev to program the sequencer (the control block in Xenos that organises shader execution at the batch level). A simple example might be to jump over portions of code in the shader, or to loop over a portion of code - doing so for all 64 objects in the batch, if they all match the same condition (or, if any one matches a condition). It's all a bit hairy, to be honest - I haven't worked out how that would be used :oops: That's where real dev comments are going to be needed. Huge chunks of this document go right over my head.

Jawed
 
Well, that was what I was basically curious about. If that part in Dave's article still holds true, then it would appear there's no guarantee that the batch will consist of adjacent pixels. Taking the shadow example again, if the pixels in the shadow, partial shade, etc. all execute the same shader (with dynamic branching), then they're all potential choices for batches. In which case, the batcher might take 6 2x2 quads that are mostly in the partial shade, and 10 2x2 quads completely in the complete shadow. And now they all have to execute both/all the branches anyway. And if enough batches are created in this manner, then it's completely inefficient. But, I'm probably just not thinking properly.

Though, I suppose the ideal situation would have a branch detected, and then pixel quads that take one branch to be rebatched with other pixels executing the same instruction/shader code. Meanwhile, the ones that take the opposing branch be rebatched the same way. And the other 64 available threads fill in this time while that data gets put into new batches/threads. But, that's just a round-about fix for having batches larger than 1 in the first place, so it wouldn't necessarily be needed later on down the line when somebody could put it in. Eh, just crazy-talk on my ignorant part, being bored and all at the moment. :???:

Still curious about the flow control, though.
 
Anand released a new article with more details about the Xbox 360 motherboard...

"Inside Microsoft's Xbox 360 - A Tour of the 360's Motherboard"
http://anandtech.com/systems/showdoc.aspx?i=2611

Also, here's another site that took apart the Xbox 360 and controller. Has pictures and discusses things like the video connector and the TSOP.

"The Soft Life"
http://softlife.blogspot.com/


BTW,

Anand mentions this...

Anand said:
Originally we assumed the chip below was a TV encoder, but we've since found out that the TV encoder on ATI's Xenos GPU is identical to what is on the ATI Radeon X1000 series of PC graphics cards - meaning the Xbox 360's TV encoder is located on the Xenos GPU itself and makes use of ATI's Xilleon display engine.

He later theorizes it might have something to do with Ethernet or an audio codec. Personally I believe it's the encoder from Microsoft's WebTV division. I thought all the Ethernet stuff was on the Southbridge?

I also find it kind of funny that Microsoft would include another TSOP. If it wasn't for the TSOP on the original Xbox, we probably wouldn't had seen the Xbox mod community get so big so fast.

Tommy McClain
 
AzBat said:
I also find it kind of funny that Microsoft would include another TSOP. If it wasn't for the TSOP on the original Xbox, we probably wouldn't had seen the Xbox mod community get so big so fast.

Funny indeed :devilish:
 
TurnDragoZeroV2G said:
And now they all have to execute both/all the branches anyway. And if enough batches are created in this manner, then it's completely inefficient. But, I'm probably just not thinking properly.
That's what it amounts to, yes.

But older GPUs with DB, i.e. NV40 and G70, are restricted to batch sizes in the hundreds and thousands of pixels!

G70 has a batch size in the region of 800 pixels - though I can't explain how that batch size is derived.

It's near-enough useless for any kind of shader that's attempting to perform per-pixel control flow.

Jawed
 
scificube said:
It not looking good for DB uage at all now :(

If all the pixels in the batch are adjacent, (i.e., 8x8 square, rather than 16 2x2 squares from any triangles in the scene with the same state), then DB won't really be less efficient than what's already here in RV530 (which should be like R580, 4x12).

Which is probably the likelihood, I was just curious if the difference in working with triangles meant that Xenos had a more intelligent batching system that might change DB efficiency (I was looking at the extreme end, towards it being less efficient, but there's the opposite extreme as well).
 
jawed said:
From the XFest documents I have Xenos has extra tricks up its sleeve to do with dynamic branching. These are instructions that allow the dev to program the sequencer (the control block in Xenos that organises shader execution at the batch level). A simple example might be to jump over portions of code in the shader, or to loop over a portion of code - doing so for all 64 objects in the batch, if they all match the same condition (or, if any one matches a condition). It's all a bit hairy, to be honest - I haven't worked out how that would be used That's where real dev comments are going to be needed. Huge chunks of this document go right over my head.
Would it be possible to conclude that such tricks position Xenos' DB performance closer R520 than previously speculated? Could it be that R520 offers just as much granularity in its batch sequencer, but that access to its functionality will be moreso responsibility of Ati's driver team as opposed to devs?
 
Last edited by a moderator:
I dunno, really. I haven't found any examples, and I'm not sure what kind of meaning you can attach to a random selection of 64 pixels.

R520 should be able to perform DB at the vertex level in vertex shader programs - something that Xenos can't do (since it operates on 64 vertices).

Sequencer programming is something you have to do in microcode. Seems like the kind of thing that's gonna take a while to get to grips with.

Jawed
 
Jawed said:
R520 should be able to perform DB at the vertex level in vertex shader programs - something that Xenos can't do (since it operates on 64 vertices).
Jawed
From what I heard, R520s vertex shader units still operate under a single SIMD issuing node, so I'm guessing dynamic branching is still an issue if only a small portion of vertices in a batch require DB. There may be branch prediction logic, though.
 
While I'd love to believe, and enjoy reading, Jawed's propaganda..

I have issues with, if ATI could pack 64 pipes into 230m transistors, why they didn't do it in desktops.

In fact, it's a ludicrous number, considering they have 16 in a 321m transistor R520.

Or for example, if the second ALU provides few benefits at great transistor cost, again, why is it in ATI desktop parts?
Doesn't make any sense..
And all this batch stuff is naturally over my head..

At this point, I'd much rather MS had waited until spring to launch, and used R580. That likely would have assured total superiority over PS3 if what we hear of R580 is true (48 pipes).

Or really, if they had just scrapped the EDRAM, they would have been fine. They used nearly 340 million transistors, which is a very large chip.

Or I think, they could have gone with dual Xenos, and made up the cost any number of way, for that matter. If analysts speculate each GPU costs $100, they could have raised the price $100, and still been at a respectable 399/499. After all, Blu Ray alone will likely cost Sony $100 initially. So they might have been at cost parity with Sony at worst with Dual Xenos. Not to mention Cell is much more expensive than it's Xbox counterpart. Hell without the EDRAM< Xenos is a small chip anyway. They could have included two, and with the way they are doing this this go around, they likely still would have been hugely better off costwise than they were with Xbox one. After all, 65nm is right around the corner.

It remains to be seen, but I think MS botched the X360..and I think they had victory in all phases in their hands, too..
 
Last edited by a moderator:
Bill said:
Or for example, if the second ALU provides few benefits at great transistor cost, again, why is it in ATI desktop parts?
Doesn't make any sense..

Perhaps this is the answer:
http://www.beyond3d.com/reviews/ati/r520/index.php?p=03
Although everything in the pipeline has been re-engineered to hit new target clocks that the 90nm process can enable and the capabilities have been extended for Pixel Shader 3.0 operation, the same ALU structure has been kept partially because ATI already have a highly optimised shader instruction compiler, which would need to be re-written for any different ALU organisation.
 
That's the explanation I think of, too.

But it makes one wonder why ATI wouldn't have been smart enough to build a better pipeline to start with.

Or why they wouldn't rewrite the compiler, if it would save them massive silicon costs.

Anyway, do we assume MS wrote the compiler for Xenos? And therefore the cost was on them?
 
Back
Top