View Full Version : ATI Xenos: XBOX 360 Graphics Demystified
Dave Baumann
12-Jun-2005, 23:37
<a href="http://www.beyond3d.com/articles/xenos/"><img border="1" src="http://www.beyond3d.com/siteimages/b3dsmall.gif" align="right" width="100" height="66"></a>Since Microsoft announced their next generation console platform, XBOX 360, at this years E3 various aspects of the technology have been spoken of including “Xenos” (formerly rumoured as “R500”), the graphics processor for the console. ATI have been presenting some information on it so far and whilst one thing is clear, it is a very different graphics processor to those we have seen before, there is has still been very little understanding of what Xenos really is and how it works.
We’ve spent some time with the lead architects of the graphics processor in order to gain a greater understanding of how Xenos works and in this article we hope to lay that out in a little greater detail. This article explore some of the operations of the graphics system, such as how the eDRAM module fits into the graphics processing and how it handles resolutions greater than the eDRAM size. We also take a look at how the unified shader architecture at the heart of Xenos operates and look at why ATI have implemented such a system and speculate on how this may affect upcoming PC graphics processors from ATI.
<a href="http://www.beyond3d.com/articles/xenos/">Read the full article here</a>.
Page 2, ATI C1 / Xenos, 1st paragraph truncated and duplicated.
[edit]
Page 2, Bandwidths and Interconnects, this section seems to be Page 3 instead, so should be removed from here.
Groovy!
Some suggested corrections for the first page:
handled in hardware adding, a T&L engine was a significant step up the OpenGL pipeline,
handled in hardware, adding a T&L engine was a significant step up the OpenGL pipeline,
up until up until the point that
up until the point that
Both the pixel pipelines "both" is redundant here
with each vertex shaders enveloping the T&L processors entirelyagreement problem
couldn’t continue to exist and in the PC space and it certainly seems
couldn’t continue to exist in the PC space and it certainly seems
how we thing of the overall pipeline
how we think of the overall pipeline
performance assessments based "pipelines" alone
performance assessments based on "pipelines" alone
Bear in mind that we are under NDA for some of the operational details of the graphics processor to gain an understanding of how it differs from current platforms however some of the specifics are still under NDA and won't be revealed in full detail in this article
Bear in mind that we are under NDA for some of the operational details of the graphics processor to gain an understanding of how it differs from current platforms and so some specifics won't be revealed in full detail in this article.
Jawed
dizietsma
13-Jun-2005, 20:15
Dave, you seem to be rendering things twice in the review as mentioned above !
trinibwoy
13-Jun-2005, 20:17
Yeah, page 2 leads to page 3 which is a copy of page 2!!!
trinibwoy
13-Jun-2005, 20:23
Also on page 4 there seems to be a duplicated point:
Render to texture operations will also be written to the eDRAM and then passed back to system RAM for use as a texture when they are needed. The render to texture can occur with Multi-Sampling enabled and the developer can choose to resolve that down or keep it at the Multi-Sampled level when it is taken out of the eDRAM to be written to the UMA memory.
Render to texture operations will also be rendered out to the eDRAM first and then read out to UMA memory, when complete, in order to be used as a texture surface for the final frame rendering. Render to texture operations can also have Multi-Sample FSAA applied and the result can either be resolved on the way out to system memory or kept at the high resolution Multi-Sample level.
ALU is used but isn't defined.
The first time it is encountered there should be "ALU (Arithmetic Logic Unit)"
Page 7, 4th paragraph.
"Additional to the 48 ALU's is additional logic that performs all the pixel shader interpolation calculations"
That's poorly worded ^^, I suggest:
"Additionaly to the 48 ALU's is dedicated logic that performs all the pixel shader interpolation calculations"
Page 7, 5th paragraph.
"As each of the three shader ALU arrays is a separate array there is no dependency between one another so what programs are being executed"
I suggest:
"Each of the three shader ALU's array is independant, so what programs are being executed"
There seems to be confusion over whether there is lossless Z compression, on page 4:
as such this also means that Xenos does not need any lossless compression routines for Z or colour when writing to the eDRAM frame buffer
the pixels are then transferred to the daughter die in the form of source colour per pixel and loss-less compressed Z
Jawed
Ha! Told you should have waited Dave! :wink: Well, okay, I didn't actually say that but still... 8) Anyway, the proof-reading is very kind of you all (honestly) but I do believe some genuine congratulations need to be given to Dave for taking the time and sheer effort to produce such an article.
Thank you Dave, you did a great work.
There is so much to discuss there I don't know where to start, R500 packs tons of new stuff.
What thrills me most is the memexport mechanism, it really opens a new whole world of algorithms amenable of a GPU implementation.
Actually I retract the lossless-Z compression point as it seems one is talking about bandwidth usage within the EDRAM unit and the other is talking about bandwidth usage twixt parent and daughter dies.
Jawed
Ha! Told you should have waited Dave! :wink: Well, okay, I didn't actually say that but still... 8) Anyway, the proof-reading is very kind of you all (honestly) but I do believe some genuine congratulations need to be given to Dave for taking the time and sheer effort to produce such an article.
I believe our proof-reading IS our way to thank Dave for the time & effort he put into it.
But to make it clear : "Thanks Dave" :)
trinibwoy
13-Jun-2005, 20:34
Ha! Told you should have waited Dave! :wink: Well, okay, I didn't actually say that but still... 8) Anyway, the proof-reading is very kind of you all (honestly) but I do believe some genuine congratulations need to be given to Dave for taking the time and sheer effort to produce such an article.
I was keeping my praise to the end of the article. But you kinda have to post the faux pas immediately lest you forget.
But it has been an excellent read thus far and certainly answers several questions that have been bandied about the forums here. It does raise several questions too, as to the feasibility of some of these innovations making it to the PC space.
I like how he used strikethrough on "Pipeline" in the title on page 7. :lol: Talk about hammering the point home!
Ok can i get a 2 sentence summation for us dumb folks
Ok can i get a 2 sentence summation for us dumb folks
You're going to want an XBox360 ! ;)
(At least I want to see one in action RIGHT NOW, after having read the article. [And I would love a XBox360 DevKit!])
Titanio
13-Jun-2005, 20:48
As posted in the locked thread: Good article. Good to have clarification on a number of points (the SIMD engines can be individually assigned to different workloads afterall). Still questions remain, but this is the best article thusfar on Xenos, IMO. Thanks Dave!
Shifty Geezer
13-Jun-2005, 20:54
Thank you Dave, you did a great work.
There is so much to discuss there I don't know where to start, R500 packs tons of new stuff.
C1 nAo, not R500! Didn't you read the article? :P
Many thanks Dave. This gives an great insight into what looks to be an excellent GPU. It seems damned well thought out and implemented and presented like this, I can't see the point in conventional shaders. Adapting to both vertex and pixel shaders the technology seems very portable too, for TV displays (without vertex work) and workstations (mostly vertex work), one size fits all, which should result in cheap production is marketted to those areas.
Is there any idea when we'll get the first real-world benchmarks? The first Xenos equipped XB360 dev kits are out there right now, no? Though I guess NDAs will prohibit insight into teething problems and successes :(
This chip only needs Fast14 tech to be perfect
/Runs and Hide before the meteor storm/
Page 5:
Hierarchical Z buffers contain "courser" Z information
Hierarchical Z buffers contain "coarser" Z information
Jawed
The inefficiencies of current chips are what we really need insight to before we can deduce how much better '95% efficient' really is. Does seem like a good idea though - looking forward (I think) to unified shaders on the PC.
Thanks for the read Dave! 8)
I'll need to read it again before I can digest and fire questions! :P
I need to disappear now but until then, Xenos ~ 382 million transistors and at 500 MHz sounds like a beast! :P
PS.
I thought the PS3 was NUMA?
Alpha_Spartan
13-Jun-2005, 21:03
What a bangup job Dave. I mean, some of it was over my head (I'm new to the graphics circle), but overall I enjoyed the balanced tone. I await your analysis of the RSX architecture with pregnant anticipation.
trinibwoy
13-Jun-2005, 21:10
Thanks Dave. Although I don't think the barriers to entry for such a unified architecture on the PC are well understood (at least by me), I do agree with Shifty that on a principle/design level it sure beats what we have now.
Oh and on the last page - should that be trialing (though I doubt trial has a verb derivative).
Perhaps the answer lies in the fact that this is such a big change that trailing it in a closed box environment.
owever the original semiconductor manufacturers are likely to still be in charge of further developments in terms of putting the cores on to smaller processes and we believe that this is part of the contract that ATI has with Microsoft. An obvious area for cost reduction of the Xenos processor is by merging the shader and daughter die on to a single core - we suspect that this will not happen until there is a process shrink available (that can also cater for both the complex logic and eDRAM) as two cores on 90nm mitigate some of the yield risks of a single, large die on 90nm.
couldn't they do the logic portion on 65nm and the edram on 90nm and still combine the two (Assuming the 65nm process doesn't support edram) and get cost reductions ?
trinibwoy
13-Jun-2005, 21:23
couldn't they do the logic portion on 65nm and the edram on 90nm and still combine the two (Assuming the 65nm process doesn't support edram) and get cost reductions ?
My guess is that it is probably easier to migrate the edram to 65nm than it is for the main chip. AFAIK (which isn't much :)) there's nothing special about edram transistors to make a die shrink incompatible.
couldn't they do the logic portion on 65nm and the edram on 90nm and still combine the two (Assuming the 65nm process doesn't support edram) and get cost reductions ?
My guess is that it is probably easier to migrate the edram to 65nm than it is for the main chip. AFAIK (which isn't much :)) there's nothing special about edram transistors to make a die shrink incompatible. I was under the impression it was the other way around. That edram is only avalible on certian processes and by certian fabs because of density ?
chlovinka
13-Jun-2005, 21:39
couldn't they do the logic portion on 65nm and the edram on 90nm and still combine the two (Assuming the 65nm process doesn't support edram) and get cost reductions ?
My guess is that it is probably easier to migrate the edram to 65nm than it is for the main chip. AFAIK (which isn't much :)) there's nothing special about edram transistors to make a die shrink incompatible.
Wouldn't current leakage be a problem for edram @ 65nm as well?
Great article Dave, thanks a bunch!
Ok, on my mark, ready, set, C1 vs RSX. FIGHT!
[omg i'm just kidding!] :P
dukmahsik
13-Jun-2005, 21:41
xenos is a freakin beast, well thought out, future proof, and very efficient
Rockster
13-Jun-2005, 21:45
One thing I would like clarification on is the bus between the parent and daughter die. The leaked block diagram showed it as 32GB/s write and 16GB/sec read. Dave's diagram shows a single 32GB/s bi-directional interface. Which is correct?
Laa-Yosh
13-Jun-2005, 22:00
Fixed function tesselation sounds a bit of a letdown, although the details are obviously missing at this moment. Still, view-dependent tesselation sounds cool...
The GameMaster
13-Jun-2005, 22:00
Hrmm... I heard that the bandwidth between the parent GPU and the daughter GPU eDRAM module was 256GB/sec, though that might be wrong. I also heard the 32GB/sec (write) and 16GB/sec (read) bandwidth between the two modules... guess I am going to have to get additional clarification on that. Also the bandwidth between the CPU (XENON) and the GPU (XENOS) seems to be incorrect... it was stated that the bandwidth between the CPU and the GPU was FULL DUPLEX, in other words 21.6GB/sec in both directions at the same time, not 10.8GB/sec both directions at the same time, or effectively 43.2GB/sec in total bandwidth. This would also apply to the PS3 where it has 20GB/sec (write) and 15GB/sec (read) between the Cell and the RSX, or effectively 35GB/sec in total bandwidth.
XENOS itself is not WGF 2.0 compliant, but the XBox360 system is... XENOS lacks the ability to create vertices, but that ability was moved to the CPU (XENON).
This is a very good article regardless and has provided me with some new information to think about.
Hrmm... I heard that the bandwidth between the parent GPU and the daughter GPU eDRAM module was 256GB/sec, though that might be wrong. I also heard the 32GB/sec (write) and 16GB/sec (read) bandwidth between the two modules... guess I am going to have to get additional clarification on that. Also the bandwidth between the CPU (XENON) and the GPU (XENOS) seems to be incorrect... it was stated that the bandwidth between the CPU and the GPU was FULL DUPLEX, in other words 21.6GB/sec in both directions at the same time, not 10.8GB/sec both directions at the same time, or effectively 43.2GB/sec in total bandwidth. This would also apply to the PS3 where it has 20GB/sec (write) and 15GB/sec (read) between the Cell and the RSX, or effectively 35GB/sec in total bandwidth.
XENOS itself is not WGF 2.0 compliant, but the XBox360 system is... XENOS lacks the ability to create vertices, but that ability was moved to the CPU (XENON).
This is a very good article regardless and has provided me with some new information to think about.
So is the Xenos better or worse than u expected
Titanio
13-Jun-2005, 22:10
it was stated that the bandwidth between the CPU and the GPU was FULL DUPLEX, in other words 21.6GB/sec in both directions at the same time, not 10.8GB/sec both directions at the same time
I think Dave got this right, if it was 21.6GB/s both ways, they'd list 43.2Gb/s as the total FSB. Full Duplex doesn't mean you double the bandwidth if the bandwidth given was the aggregate in the first place..
XENOS itself is not WGF 2.0 compliant, but the XBox360 system is... XENOS lacks the ability to create vertices, but that ability was moved to the CPU (XENON).
Creating vertices on the CPU doesn't necessarily make X360 as a whole 2.0 compliant. IIRC, WGF 2.0 hasn't even been finalised yet, so being compliant, technically isn't possible. X360 may be close in some aspects to the final 2.0 spec however.
trinibwoy
13-Jun-2005, 22:10
couldn't they do the logic portion on 65nm and the edram on 90nm and still combine the two (Assuming the 65nm process doesn't support edram) and get cost reductions ?
My guess is that it is probably easier to migrate the edram to 65nm than it is for the main chip. AFAIK (which isn't much :)) there's nothing special about edram transistors to make a die shrink incompatible. I was under the impression it was the other way around. That edram is only avalible on certian processes and by certian fabs because of density ?
Ah good point. I guess it's to be seen how that's handled at 65nm.
Dave, Ken Kutaragi has a question :P
The vertex shader and pixel shader are unified in ATI's architecture, and it looks good at one glance, but I think it will have some difficulties. For example, some question where will the results from the vertex processing be placed, and how will it be sent to the shader for pixel processing. If one point gets clogged, everything is going to get stalled. Reality is different from what's painted on canvas. If we're taking a realistic look at efficiency, I think Nvidia's approach is superior.
GameSpot (http://www.gamespot.com/news/2005/06/13/news_6127392.html)
And great job! Congratulations!
The GameMaster
13-Jun-2005, 22:21
it was stated that the bandwidth between the CPU and the GPU was FULL DUPLEX, in other words 21.6GB/sec in both directions at the same time, not 10.8GB/sec both directions at the same time
I think Dave got this right, if it was 21.6GB/s both ways, they'd list 43.2Gb/s as the total FSB. Full Duplex doesn't mean you double the bandwidth if the bandwidth given was the aggregate in the first place...
I know, it was the fact that certain people was going around saying that the PS3 Cell CPU has 35GB/sec bandwidth to the RSX, which is false as it is 20GB/sec one way and 15GB/sec the other way at the same time, but not 35GB/sec one way. In any case there was more than one interview with ATI that stated that 21.6GB/sec was full duplex or 21.6GB/sec in both directions at the same time or 21.6GB/sec (read) and 21.6GB/sec (write) at the same time, depending on how you read it. You are right it is *NOT* 43.2GB/sec in aggregate bandwidth... it is still 21.6GB/sec. The problem is that Dave listed the FSB as 10.8GB/sec (read) and 10.8GB/sec (write) in both directions at the same time. In any case here is a link to one of the such interviews, this one by HardOCP....
Now between the GPU and the CPU, things get a bit fuzzier. And by “fuzzier,” I mean that they would not tell me much about it at all. The bus between the CPU and GPU was characterized as unique and proprietary. Mr. Feldstein did let on that the bus could shuttle up to 22 Gigabytes of data per second. Much like GDDR3, this would be a full duplex bus, or one that “goes both ways” at one time. Beyond that, not much was shared.
Regardless some of the points that was brought up in Dave's article will require me to double check some of my information and clarify some of these points. A very good article regardless.
Regardless some of the points that was brought up in Dave's article will require me to double check some of my information and clarify some of these points. A very good article regardless.
Gamemaster what exactly *is* your information?
1 :?: ,
Does Xenos use virtual memory :?:
It could save a lot of BW...
AlStrong
13-Jun-2005, 22:30
why would it need virtual mem if it has access to the entire 512MB? :P
Titanio
13-Jun-2005, 22:32
it was stated that the bandwidth between the CPU and the GPU was FULL DUPLEX, in other words 21.6GB/sec in both directions at the same time, not 10.8GB/sec both directions at the same time
I think Dave got this right, if it was 21.6GB/s both ways, they'd list 43.2Gb/s as the total FSB. Full Duplex doesn't mean you double the bandwidth if the bandwidth given was the aggregate in the first place...
I know, it was the fact that certain people was going around saying that the PS3 Cell CPU has 35GB/sec bandwidth to the RSX, which is false as it is 20GB/sec one way and 15GB/sec the other way at the same time, but not 35GB/sec one way.
Correct, it's 35GB/s aggregate, asymmetrical.
In any case there was more than one interview with ATI that stated that 21.6GB/sec was full duplex or 21.6GB/sec in both directions at the same time or 21.6GB/sec (read) and 21.6GB/sec (write) at the same time, depending on how you read it. You are right it is *NOT* 43.2GB/sec in aggregate bandwidth... it is still 21.6GB/sec.
21.6GB/s aggregate suggests it is that which is split - 10.8 both ways, symmetrical. I'm guessing 21.6GB/s is the aggregate figure, but I'm sure Dave knows, and if it was otherwise he'd have said so.
In any case here is a link to one of the such interviews, this one by HardOCP....
Now between the GPU and the CPU, things get a bit fuzzier. And by “fuzzier,” I mean that they would not tell me much about it at all. The bus between the CPU and GPU was characterized as unique and proprietary. Mr. Feldstein did let on that the bus could shuttle up to 22 Gigabytes of data per second. Much like GDDR3, this would be a full duplex bus, or one that “goes both ways” at one time. Beyond that, not much was shared.
This doesn't suggest it's ~22 both ways. Just that the bus goes both ways, as any duplex bus does. If it was ~22 both ways, they'd be claiming ~44 total bandwidth.
So, my favourite things:
1. Alpha to Mask - means there's no need for supersampling to get anti-aliased alpha-blended textures - I think!
2. Ultra-fast Z-only pass - should massively reduce overdraw - wow!
3. MEMEXPORT - arbitrary tesselation algorithms - I think!
Things I'm a bit disappointed with:
1. 16-way SIMD ALU arrays - obviously it's a compromise based-upon the overhead associated with the complex arbitration and scheduling of threads. I worry that there'll be a fair amount of wastage with small triangles and there'll also be a fair amount of wastage in per-pixel dynamically branched code.
2. maximum 4xAA - in my brief experience with MSAA (Radeon 9800 Pro only lasted 5 or 6 months) 4xAA hardly seems better than no AA to me - movement causes an awful lot of sparkles.
Anyway, that's just geek interest, cos I have no intention of actually buying a console...
Jawed
Thowllly
13-Jun-2005, 22:56
on page 4
The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an available range of -32.0 to 32.0.If the range really is -32 to 32 then it can't be 3b exponent, 7b mantissa, you need a sign bit too: 1b sign, 3b exponent & 6b mantissa.
sonix666
13-Jun-2005, 22:58
on page 4
The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an available range of -32.0 to 32.0.If the range really is -32 to 32 then it can't be 3b exponent, 7b mantissa, you need a sign bit too: 1b sign, 3b exponent & 6b mantissa.
That's true, but because the mantissa is always normalised (meaning that the most significant bit is always 1) the most significant bit is not stored. So in essence it can still be 3 bit exponent, 7 bit mantissa and 1 bit sign. Oh, the joys and horrors of expressing FP formats! ;)
Rockster
13-Jun-2005, 23:00
Can Z and Color depths be different?
Thowllly
13-Jun-2005, 23:08
on page 4
The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an available range of -32.0 to 32.0.If the range really is -32 to 32 then it can't be 3b exponent, 7b mantissa, you need a sign bit too: 1b sign, 3b exponent & 6b mantissa.
That's true, but because the mantissa is always normalised (meaning that the most significant bit is always 1) the most significant bit is not stored. So in essence it can still be 3 bit exponent, 7 bit mantissa and 1 bit sign. Oh, the joys and horrors of expressing FP formats! ;)
I know, but that extra bit is usually not included when giving the number of bits for the mantissa.
Also, on page 5
1080i equates to 1920x1080 pixels, however interlacing means that only the odd horizontal lines are refreshed on one cycle and the even lives on the next, which means that the frame buffer is only ever needing to handle 1920x540 pixels per refresh.That is called field rendering and can only be used if the game always runs at constant 60fps. Also, since the inbetween lines are not rendered, the image can not be flickerfiltered, meaning lots of flicker on crts. I would be somewhat surprised if Xbox360 games actually used field rendering...
blakjedi
13-Jun-2005, 23:15
EDIT: Every use of whilst shoudl be while.
Memexport is very sexy and makes me think that the 192 processors on teh daughter die can be the equivalent of a mini PPU
EDIT: Every use of whilst shoudl be while.
Memexport is very sexy and makes me think that the 192 processors on teh daughter die can be the equivalent of a mini PPU
Blak I have a feeling you speak a different dialect of English than does Dave. :wink:
EDIT: Every use of whilst shoudl be while.
Memexport is very sexy and makes me think that the 192 processors on teh daughter die can be the equivalent of a mini PPU
Blak I have a feeling you speak a different dialect of English than does Dave. :wink:
than Dave does* my friend...
than Dave does..
Just a little survey: Does anyone think that the 'r'-word (revolutionary) could be described to this architecture? :D
than Dave does* my friend...
than Dave does..
Hey Coola that's just how I write (and talk) - you just have to deal with someone who's used to a lot of legalese. 8)
(but no I'm not a lawyer)
Just a little survey: Does anyone think that the 'r'-word (revolutionary) could be described to this architecture? :D
Not unless there is a gyroscope and a microphone involved! :lol:
Titanio
13-Jun-2005, 23:39
Memexport is very sexy and makes me think that the 192 processors on teh daughter die can be the equivalent of a mini PPU
A Physics Processing Unit? :lol:
Someone can correct me if I'm wrong here, but the logic on the daughter die is very specialised. It wouldn't be of much use beyond what it is designed for, and there isn't exactly a tonne of it. Not to mention it'll be busy as is, can only work on data in the eDram (who's space will be needed for the framebuffer), and I'm not sure if the daughter die logic can use memexport anyway (?)
Did i understand it correctly that all z-units including those for MSAA can be used when no color/MSAA is used for the pre-z pass to write 64zixels/cycle?
Better AF -> is it programmable?
I assume fp texture filtering is supported it wasn't really mentioned
can everyone give a Grade A to F of what you think of Xenos, A being revolutionary, F being useless
Did i understand it correctly that all z-units including those for MSAA can be used when no color/MSAA is used for the pre-z pass to write 64zixels/cycle?
Yep, we've known this for some time now...
Jawed
PeterAce
14-Jun-2005, 00:54
Fantastic article Dave.
It answered many of the questions (in my mind from the leak/launch).
PeterAce
14-Jun-2005, 00:58
I wonder if the 'Hierarchical Stencil Buffer' will make it on to a PC Desktop VPU before the switch to this new type of architecture?
I know Dave mentioned 64 z operation with 4xMSAA one time but there was no real answer whether also 64 zixel could be written out without MSAA.
EDIT:
So after reading it again is it right that xenos can't output more than 16zixels/cycle but it can do up to 64 z operation with 4xMSAA
Xenos has 8GZixel fillrate @500MHZ is that correct?
Dave - first of all thanks for the a great write-up. I have a couple questions:
1.) Does a single group (as operated on by the SIMD ALUs) contain only 16 pixels/verts as opposed to the 64 indicated by the leak?
2.) Can you expand on the nature of the hierarchical buffers? E.g. what sort of attributes are tracked in hierarchical fashion (stencil parity, actual stencil value, z min, z max)? Are hierarchical Z and stencil integrated - that is, if HZ rejects a tile of pixels, can a stencil value for the tile be modified as a result?
If HZ can reject/accept up to 64 pixels per clock (and perform stencil updates on a per-tile basis) I figure stencil shadow algorithms will speed up significantly versus just double-rate z/stencil ROPs...
on page 4
The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an available range of -32.0 to 32.0.If the range really is -32 to 32 then it can't be 3b exponent, 7b mantissa, you need a sign bit too: 1b sign, 3b exponent & 6b mantissa.
No sign bit.
What would negative values in a frame buffer mean anyway?
I know Dave mentioned 64 z operation with 4xMSAA one time but there was no real answer whether also 64 zixel could be written out without MSAA.
It's doesn't matter, if you have to have MSAA on, you lie about the size of the rendertarget, turn on MSAA, then turn it back off for the color pass.
on page 4
The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an available range of -32.0 to 32.0.If the range really is -32 to 32 then it can't be 3b exponent, 7b mantissa, you need a sign bit too: 1b sign, 3b exponent & 6b mantissa.
No sign bit.
What would negative values in a frame buffer mean anyway?
When you're writing to a render target negative values could have meaning though, couldn't they?
But anyway, an offset would obviate the need for a sign bit, wouldn't it?
Jawed
I know Dave mentioned 64 z operation with 4xMSAA one time but there was no real answer whether also 64 zixel could be written out without MSAA.
It's doesn't matter, if you have to have MSAA on, you lie about the size of the rendertarget, turn on MSAA, then turn it back off for the color pass.
But wouldn't that create a 4:1 mapping in the HZ buffer - which itself is a 4:1 mapping anyway? Which would lead to a huge loss of resolution in re-using the HZ buffer for the colour pass.
Jawed
talyn99
14-Jun-2005, 01:46
It's garbage.........this where they doin the poal right
I know Dave mentioned 64 z operation with 4xMSAA one time but there was no real answer whether also 64 zixel could be written out without MSAA.
It's doesn't matter, if you have to have MSAA on, you lie about the size of the rendertarget, turn on MSAA, then turn it back off for the color pass.
But wouldn't that create a 4:1 mapping in the HZ buffer - which itself is a 4:1 mapping anyway? Which would lead to a huge loss of resolution in re-using the HZ buffer for the colour pass.
Jawed
I really don't know how the heirarchical Z works with respect to multisampling, worst case its's blocks are larger which would allow faster rejection at the cost of less rejections at the HZ level.
on page 4
The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an available range of -32.0 to 32.0.If the range really is -32 to 32 then it can't be 3b exponent, 7b mantissa, you need a sign bit too: 1b sign, 3b exponent & 6b mantissa.
No sign bit.
What would negative values in a frame buffer mean anyway?
When you're writing to a render target negative values could have meaning though, couldn't they?
But anyway, an offset would obviate the need for a sign bit, wouldn't it?
Jawed
If your looking at something other than colors into the rendertarget, then 10 10 10 2 is probably the wrong format.
Yes you could use an offset, and loose basically one bit of precision, but the format is designed specifically to allow HDR rendering and there is no reason to have negative values in that context.
There are 48 ALUs, and there are 32 texture processors (16 filtering and 16 point sample), at least at face value.
It seems to me that at any one time only 16 texture processors can be operational as Xenos can process 64 threads, in total, simultaneously.
So, why is half the texturing capacity of Xenos idle at any one time? Are there actually 32 independent texture processors, or are they shared in some way between filtering and point sampling, meaning that only 16 can be operational at any one time?
So, is Xenos executing 64 threads concurrently (3 shader arrays of 16-way SIMD + 16 texture operations)? Or is it 80 threads (i.e. including the extra 16 texture operations)?
Jawed
Megadrive1988
14-Jun-2005, 02:06
I've read about half of the article so far. nice job Dave. I'll be reading the rest of it tonight, over coffee 8)
I am surprised that the whole article is 18 or 19 pages. I thought it would be cut down to maybe 7-9 pages. glad that there is more content than I was expecting.
Reverend
14-Jun-2005, 02:12
Thanks Dave, cleared up some confusions of mine.
So is this article the start of something new at Beyond3D? Or only if/when console graphics chip makers are willingly helpful? What about one on PS3?
DemoCoder
14-Jun-2005, 02:26
Dave,
Does MEMEXPORT kinda work like a 'phase' marker in the shader? In that, all shader code before MEMEXPORT for every pixel, must be finished, before the first instruction after MEMEXPORT can continue? Or is MEMEXPORT always the final instruction, like KILL. If you have some threads writing values A,B,C to memory, and some other threads reading A,B,C, and those reading threads can execute random access reads, there is no way to make this work without synchronization.
Otherwise, I don't see how it could work, because it is impossible to know when a given thread will finish and the value will be available for randomized(!) reading by another shader.
Is it possible that MEMEXPORT is an operation that just short-circuits sending shader results to the ROP/eDRAM chip, and instead just dumps those results to system RAM? It would still be like a render-to-texture operation, only you're not writing pixels, you're writing shader register values (streaming)
I'd like more explaination on how this is suppose dto work, and how reads and writes are synced. I think people have some notion that MEMEXPORT functions like you'd expect memmove instructions on typical single threaded CPU, with random reads and writes. But given the concurrent threaded way in which GPUs execute shader code, there's no really any way for a GPU to write value X in shader thread T to memory location L, and have another shader thread T' read that value immediately.
Also, any information on Geometry Shaders? Do they exist in Xenos?
Trawler
14-Jun-2005, 02:31
Thanks Dave, cleared up some confusions of mine.
So is this article the start of something new at Beyond3D? Or only if/when console graphics chip makers are willingly helpful? What about one on PS3?
I'm sure Dave will have a good look at RSX or it's desktop counterpart. From the sound of things it's a much more traditional approach so, from the view of a person who is interested in 3D technology, doesn't warrant as much attention as Xenos.
Excellent article Dave. Xenos is, IMHO, far more interesting than Cell from a technical standpoint. Kudos to ATI.
DemoCoder
14-Jun-2005, 02:43
Apples to Oranges. Xenos is a GPU. CELL is a CPU. From a CPU standpoint, CELL is non-traditional, and therefore very interesting. Hell, these days, anything that is different than the P4/A64 is interesting. How many CPUs do you know that have 8 SIMD DSPs with super-fast SRAM?
RSX certainly won't be as interesting.
if anyone has read all of can anyone summarize the details and innovations and amazing and ok stuff in points format for my website? thanks
quest55720
14-Jun-2005, 03:22
Thank you for the article dave. Even though I did not understand half of it I had a good time reading it.
if anyone has read all of can anyone summarize the details and innovations and amazing and ok stuff in points format for my website? thanks
Jump to your own conclusions... like the rest of us. :wink:
ERP - from Dave's article it seems the HZ works on a sample basis:
In Xenos's case the Hierarchical Z Buffer stores down to 16 sample groups, which equates to 2x2 pixel groupings with 4x FSAA enabled
Also wouldn't your trick result in different sample locations for Z evaluation between the Z and color passes? Or are you saying that Xenos can be set up to use OGMS :) ?
overclocked_enthusiasm
14-Jun-2005, 03:29
Great article Dave. The article is dated June 13, 2004 instead of 2005.
blakjedi
14-Jun-2005, 03:30
Memexport is very sexy and makes me think that the 192 processors on teh daughter die can be the equivalent of a mini PPU
A Physics Processing Unit? :lol:
Someone can correct me if I'm wrong here, but the logic on the daughter die is very specialised. It wouldn't be of much use beyond what it is designed for, and there isn't exactly a tonne of it. Not to mention it'll be busy as is, can only work on data in the eDram (who's space will be needed for the framebuffer), and I'm not sure if the daughter die logic can use memexport anyway (?)
"basically any operation that can be mapped to a wide SIMD array can be fairly efficiently achieved and in comparison to previous graphics pipelines it is achieved in fewer cycles and with lower latencies. For instance, this is probably the first time that general purpose physics calculation would be achievable, with a reasonable degree of success, on a graphics processor and is a big step towards the graphics processor becoming much more like a vector co-processor to the CPU."
ERP - from Dave's article it seems the HZ works on a sample basis:
In Xenos's case the Hierarchical Z Buffer stores down to 16 sample groups, which equates to 2x2 pixel groupings with 4x FSAA enabled
Also wouldn't your trick result in different sample locations for Z evaluation between the Z and color passes? Or are you saying that Xenos can be set up to use OGMS :) ?
Sure it would, but if I'm only using Z to reject pixels then I can use a small offset (or probably quite a large one) with no signficant issue.
If I'm using it for a depth map then the actual pixel offsets are probably not an issue.
Megadrive1988
14-Jun-2005, 03:42
From a CPU standpoint, CELL is non-traditional, and therefore very interesting. Hell, these days, anything that is different than the P4/A64 is interesting. How many CPUs do you know that have 8 SIMD DSPs with super-fast SRAM?
RSX certainly won't be as interesting.
good point DemoCoder, agreed.
Xenos is extremely interesting with its unified shader architecture and eDRAM and split die approach.
Cell is interesting as it is a complete departure from traditional CPUs, other than the fact that it has a PowerPC core.
ralexand
14-Jun-2005, 04:45
Amazing job Dave. Took me all day to read which is awesome.
Would be nice to have a diagram of a traditional pipeline vs. a c1 "pipeline"? I'm still confused on that.
Megadrive1988
14-Jun-2005, 06:28
finally finished reading the whole thing. hats off to Dave for pulling off a truly exellent article. it answered many questions, and for me, raises even more questions. can't think of what else I was going to say or ask. I'll no doubt re-read the entire article again tomorrow.
it seems that Xenos/C1 has some things that R600 will not have, and R600 will have some things that Xenos/C1 doesn't have. <--- that was one of the things I wanted to say.
Mulciber
14-Jun-2005, 06:29
Likewise, a unified pipeline also increases efficiency by removing cases where the vertex shader is idle, waiting for the pixel shader to have available slots, or the pixel shader is idle, waiting for the vertex shader to produce data. If such efficiencies are fully realised in relation to current graphics processing methodologies, this can result in either a smaller chip (hence cheaper) with the same performance as larger chips with a traditional architecture, or the same sized chip with more ALU's dedicated to processing, hence higher performance.
Does this conclusion take into account the transistors added for all the extra control logic in the C1. Also, while it is true that graphics chips in PCs are being used inefficiently, this shouldn't be the case with a closed platform such as a game console.
anyone have a laymans term summary please??
anyone have a laymans term summary please??
Look at the Inquirer in a couple of days :-)
Oh, btw, great job Dave. Ati has a very interesting architecture in their hands now. To bad that we won't see it in the PC space in quite some time. Though i guess Ati has some good reasons for doing this.
anyone have a laymans term summary please??
Look at the Inquirer in a couple of days :-)
boooooooo
ralexand
14-Jun-2005, 07:59
So does the article answer this concern from KK?
For example, some question where will the results from the vertex processing be placed, and how will it be sent to the shader for pixel processing. If one point gets clogged, everything is going to get stalled.
snakejoe
14-Jun-2005, 08:07
So does the article answer this concern from KK?
For example, some question where will the results from the vertex processing be placed, and how will it be sent to the shader for pixel processing. If one point gets clogged, everything is going to get stalled.
here, maybe :?:
http://www.beyond3d.com/articles/xenos/index.php?p=08
The Xenos shader contains a large number of independent groups of pixels and vertices (threads) which are 16 wide. In order to hide the latency of an instruction for a given thread, a number of other threads are used to "fill in the gaps". By doing this, the ALU's are fully utilized all the time, and the shader can have direct data dependency on every instruction and still run full rate.
So you would need 15megs of edram to get 720p 4x fsaa for free ? or i guess really 14.1 megs . This would also allow you to fit a 1080i 4xfsaa frame into the edram in 3 swaps which would make it cost only 5% correct ?
Titanio
14-Jun-2005, 08:50
Memexport is very sexy and makes me think that the 192 processors on teh daughter die can be the equivalent of a mini PPU
A Physics Processing Unit? :lol:
Someone can correct me if I'm wrong here, but the logic on the daughter die is very specialised. It wouldn't be of much use beyond what it is designed for, and there isn't exactly a tonne of it. Not to mention it'll be busy as is, can only work on data in the eDram (who's space will be needed for the framebuffer), and I'm not sure if the daughter die logic can use memexport anyway (?)
"basically any operation that can be mapped to a wide SIMD array can be fairly efficiently achieved and in comparison to previous graphics pipelines it is achieved in fewer cycles and with lower latencies. For instance, this is probably the first time that general purpose physics calculation would be achievable, with a reasonable degree of success, on a graphics processor and is a big step towards the graphics processor becoming much more like a vector co-processor to the CPU."
You were talking about using the daughter die logic, not the shader array.
ralexand
14-Jun-2005, 08:54
So does the article answer this concern from KK?
For example, some question where will the results from the vertex processing be placed, and how will it be sent to the shader for pixel processing. If one point gets clogged, everything is going to get stalled.
here, maybe :?:
http://www.beyond3d.com/articles/xenos/index.php?p=08
The Xenos shader contains a large number of independent groups of pixels and vertices (threads) which are 16 wide. In order to hide the latency of an instruction for a given thread, a number of other threads are used to "fill in the gaps". By doing this, the ALU's are fully utilized all the time, and the shader can have direct data dependency on every instruction and still run full rate.
Thanks, I kind of thought that was it but I wasn't sure if KK was talking about the same situation.
To snakejoe: the quote you extracted from Dave's article doesn't answer to KK concerns.
KK is talking about threads scheduling on Xenos processors.
This is what Dave wrote about this fundemental (to the new Xenos architecture) problem:
ATI, probably understandably, weren't too keen on giving many details out in regards to the prioritisation methodology, probably because there is some fairly proprietary logic behind it, but also because for the most part you shouldn't need to know much about it other than "it happens". From ATI's comments it sounds like a fairly complicated procedure, but conceptually it appears to monitor the vertex buffer and pixel export buffer (just before the transfer to the daughter die) and, depending on application program mix, there is an equation that prioritises between pixel shading and vertex shading dependant on the size of the buffers and how full they are.
Well, frankly, who here thinks KK actually understands. All he's got is the NVidia version of the story, which is why NVidia has rejected complete unification. In other words, he's nothing more than a mouthpiece for FUD.
Jawed
Evildeus
14-Jun-2005, 09:25
Great article Dave.
Could you make a new chart that compare the capabilities of Xenos, R420 and NV40?
Well, frankly, who here thinks KK actually understands.
I think he understands, it doesn't take a rocket scientist to understand this stuff, moreover KK is an electronic engineer with a lot of hw design experience.
If we can grasp some Xenos details I've no doubts KK can understand Xenos too :wink:
Well, frankly, who here thinks KK actually understands.
I think he understands, it doesn't take a rocket scientist to understand this stuff, moreover KK is an electronic engineer with a lot of hw design experience.
If we can grasp some Xenos details I've no doubts KK can understand Xenos too :wink: if there are some things we don't know because of nda i'm sure its the same for him .So really he doesn't know how ati is going to handle this and nvidia is basicly telling him they don't like unified shaders and that is what he is going with. If he had hands on experiance with the xenos then I could believe him .
krychek
14-Jun-2005, 09:38
The first subheading is Forward should it not be Foreword?
Thanks for such a detailed article Dave. I haven't read it yet, but it would be nice to have a short summary crammed with details about the architecture which you can just refer to quickly - like a short case study. But ofcourse, if everyone just goes to the summary there won't be much ad revenue :shock:
Dave, is that part:
One element that has been reported on is the number of 150M transistors in relation to the graphics processing elements of Xenon, however according to ATI this is not correct as the shader core itself is comprised from in the order of 232M transistors. It may be that the 150M transistor figure pertains only to the eDRAM module as with 10MB of DRAM, requiring one transistor per bit, 80M transistors will be dedicated to just the memory; when we add the memory control logic, Render Output Controllers (ROP's) and FSAA logic on top of that it may be conceivable to see an extra 70M transistors of logic in the eDRAM module.
speculation or fact?
Because MS announced that Xenos was a ~332M transistor part, but your article imply that Xenos is in reality made of 382M transistors.
And interesting article, thanks Dave.
I'd admit that I think the article is probably too "paper talk" for my taste, but without any sillicon available to independent parties for testing, it would have been hard to have anything else but a high level description of the inherent strenghts of the architecture.
if there are some things we don't know because of nda i'm sure its the same for him .So really he doesn't know how ati is going to handle this and nvidia is basicly telling him they don't like unified shaders and that is what he is going with. If he had hands on experiance with the xenos then I could believe him .
KK is not saying ATI can't address that problem, he's saying that with a unified shading mechanism there are certain problems, and in fact he's right.
We already exposed/talked about this many months ago, I remember talking about it at least 2 o 3 years ago in a thread about future technology.
KK is not saying Xenos is going suck, even if he's obviously trying to do some damage control.
When KK can describe why pixels output by vertex shading would clog and stall the entire chip, then I'll give him some due. All he's doing is saying "I don't think it works."
So far I've not seen any explanation, anywhere, why a complete stall would occur.
Even a triangle that covers the entire screen can't stall Xenos, because the only buffering required is for the vertices of the triangle while the triangle is being rasterised.
He's talking bull and he knows it.
Jawed
The throughput of the system is such that ATI expect to be able to achieve two loops, two texture instructions and 6 ALU instructions per pixel, per cycle at Xenos's peak fill-rate.
Based on 8 pixels per clock output, I can see how to derive two texture instructions (capacity of 16 filtering texture processors) and 6 ALU instructions (capacity of 48 ALUs) but I wonder what it is that determines "two loops".
Any ideas?
Jawed
So far I've not seen any explanation, anywhere, why a complete stall would occur.
Jawed
OMG, we already discussed that at least on a couple of different threads on the console's forum.
If stall couldn't occur I bet ATI would quote 100% efficiency, not just a ludicrous 95% number :)
No the efficiency is due to the 16-way SIMD architecture, which means that some ALUs will have no pixels to process (off-triangle) or some ALUs will process pixels needlessly in per-pixel dynamically branching code.
If Xenos was fully MIMD then there'd be no efficiency fall-off. The overhead in scheduling hardware, instruction decode and register incoherence would be huge though :shock:
Jawed
Oh I see, so It has a perfect scheduler, an oracle :P
The only thing that's non-deterministic is texturing. All other instruction types have entirely deterministic execution times.
While a pixel thread is waiting for texture filtering, say, there are other pixel threads that will be using the entire ALU capacity of Xenos.
Obviously if all your pixel shaders are extremely short then Xenos runs into its fill-rate limit of 8 pixels per clock.
Jawed
JF_Aidan_Pryde
14-Jun-2005, 10:50
Two things:
- NV reckons they have near full efficiency with their current architecture
- David Kirk believes in 'eventual unification' of the shader hardware
Shifty Geezer
14-Jun-2005, 11:13
Regards KK's comments, he questions how ATi could avoid stalls. He didn't have the benefit of Dave's article to explain the answer to him. We've got insight now that no-one had before. Would you lame-brains please stop trolling around with your insufferably invasive 'so and so's lying' nonsense and stick to talking about the points of the hardware!
Regards the 16 ALU arrays, before a swtich between Vertex and Pixel work, the array will have to have finished it's current load, right? Does each ALU process one shader program? Will there be a situation where processing a pixel group, 15 ALU's finish their simple shader programs while the 17th ALU is working on a complicated shader, and that 16 ALU cluster will hae to wait for that one ALU to finish before switching to Vertex work? But if so, those 15 ALU's can work on other pixel shaders until the complicated one is finished, right? But then the same situation might arrise.
So what kind of break-period is needed for the shader groups to switch task? Is it a case of the scheduler deciding when to feed more data to pixel processing ALU's, and when to halt data to let them finish and then switch to Vertex work?
Agisthos
14-Jun-2005, 11:15
A quick question,
a lot of people were led to believe the bandwidth between the parent and daughter die is 256gb's. But it now look like it is only 32gb's.
The 256gb's is for between the
Logic Controllers/ROP's <----> Edram
and thus onchip.
Is this correct yes/no ?
The 256gb's is for between the
Logic Controllers/ROP's <----> Edram
and thus onchip.
Is this correct yes/no ?
Yes, it is
Regards the 16 ALU arrays, before a swtich between Vertex and Pixel work, the array will have to have finished it's current load, right?
Yes. But the array can run a different thread each clock cycle. So there's no actual "switch" required. The ALUs run at least a 2-way interleaved threading, e.g. every even clock cycle an instruction from thread A is executed, every odd clock cycle an instruction from thread B is executed. The level of interleaving may be higher, we don't know...
Does each ALU process one shader program? Will there be a situation where processing a pixel group, 15 ALU's finish their simple shader programs while the 17th ALU is working on a complicated shader, and that 16 ALU cluster will hae to wait for that one ALU to finish before switching to Vertex work? But if so, those 15 ALU's can work on other pixel shaders until the complicated one is finished, right? But then the same situation might arrise.
A shader array executes one thread across all 16 ALUs simultaneously. Due to per-pixel dynamic branching in the shader code it's quite possible that only 1 pixel out of the 16 being processed is actually affected by the shader code. The same applies at triangle boundaries, where one or more pixels in a "quad" at the triangle boundary is actually not part of the triangle - an ALU will be processing a shader on "thin air".
This could be the case for tens or hundreds of shader instructions.
So what kind of break-period is needed for the shader groups to switch task? Is it a case of the scheduler deciding when to feed more data to pixel processing ALU's, and when to halt data to let them finish and then switch to Vertex work?
There's no break period. The ALUs have an extremely short pipeline around them which means that the scheduler can react quickly to threads that need to come out of context (e.g. branch prediction failure). We don't know what the thread-interleave for the ALUs is, but it'll be balanced against the pipeline length.
It's arguable whether there is an ALU pipeline, Dave didn't really clarify this. Dave's also unable to quantify how many threads are in flight at one time.
All we know is that Xenos can hold a maximum of 4000 shader instructions concurrently (split arbitrarily across vertex and pixel shaders). There's plenty of room for speculation based on just that fact :)
Jawed
Shifty Geezer
14-Jun-2005, 12:09
If I've got this right then, all the ALU's are working on the same shader, not one ALU per shader (can a pixel contain multiple shaders for combined effects?). Rather than take 16 pixels and process them in parallel on 16 ALUs, each of the sixteen pixels is buzzed through the ALU array with multiple ALU's working on it if needed, so they all finish at the same time pretty much. And there's no context switch, but the next batch of data, vertex or pixel, is presented. This elliminates any wait state.
:?:
Dave Baumann
14-Jun-2005, 12:16
1.) Does a single group (as operated on by the SIMD ALUs) contain only 16 pixels/verts as opposed to the 64 indicated by the leak?
2.) Can you expand on the nature of the hierarchical buffers? E.g. what sort of attributes are tracked in hierarchical fashion (stencil parity, actual stencil value, z min, z max)?
1.) NDA.
2) Not beyond the content of the article unfortunately.
There are 48 ALUs, and there are 32 texture processors (16 filtering and 16 point sample), at least at face value.
It seems to me that at any one time only 16 texture processors can be operational as Xenos can process 64 threads, in total, simultaneously.
So, why is half the texturing capacity of Xenos idle at any one time? Are there actually 32 independent texture processors, or are they shared in some way between filtering and point sampling, meaning that only 16 can be operational at any one time?
So, is Xenos executing 64 threads concurrently (3 shader arrays of 16-way SIMD + 16 texture operations)? Or is it 80 threads (i.e. including the extra 16 texture operations)?
Jawed
First the 64 thread figure is not accurate. Second a single texture thread could encompass more than one texture. Third a “thread” is not a single operation but a group of operations of the same state (i.e. a single thread will occupy either 16 ALU’s for however many instruction it has, or the texture samplers, or the vertex fetch units.
So is this article the start of something new at Beyond3D? Or only if/when console graphics chip makers are willingly helpful? What about one on PS3?
As I've mentioned before the thrust of this article was really on the notion that this is an interesting approach to rendering in its targeted application and to see how it works and what capabilities it has and also because important aspect of this chip might give us clues as to what is to come. We will see about RSX when more is being said about it, but certainly from the public information so far it appears not to have the eDRAM / Tiling mechanism to look at.
Does MEMEXPORT kinda work like a 'phase' marker in the shader?
What was in the article was all the information I have at present.
Also, any information on Geometry Shaders? Do they exist in Xenos?
No, they are not there.
Does this conclusion take into account the transistors added for all the extra control logic in the C1. Also, while it is true that graphics chips in PCs are being used inefficiently, this shouldn't be the case with a closed platform such as a game console.
In truth I'm not yet convinced they are that substantial, at least I am fairly happy about my assessment of this architectures application to the mobile space and the benefits of a unified platform there (fewer ALU's, but in constant use) how to outweigh the drawbacks (extra control logic) in order for it to be worthwhile. Also, for the most part the inefficiencies we are talking about here are pretty much inherent in shader processing, not just the environment that they are being utilised in.
Could you make a new chart that compare the capabilities of Xenos, R420 and NV40?
More or less you have it with the inclusion of SM2.0 and SM3.0 in the chart.
Dave, is that part:
speculation or fact?
Because MS announced that Xenos was a ~332M transistor part, but your article imply that Xenos is in reality made of 382M transistors.
The 232M figure for the parent die comes directly from ATI, the 150M for the daughter die was supposition given the initial talk of 150M at the release time - I wasn't aware that MS had said anything differently since then. I'll see if I can get further clarification.
All other instruction types have entirely deterministic execution times.
Yes, this is absolutely the case and its with this knowledge in mind that the entirety of the threading operation works on Xenos.
- NV reckons they have near full efficiency with their current architecture
- David Kirk believes in 'eventual unification' of the shader hardware
And do you think those two statements sit happily together? (i.e. If the first is realy the case, in the context being discussed here, what is the point of doing the second?)
Regards the 16 ALU arrays, before a swtich between Vertex and Pixel work, the array will have to have finished it's current load, right? Does each ALU process one shader program? Will there be a situation where processing a pixel group, 15 ALU's finish their simple shader programs while the 17th ALU is working on a complicated shader, and that 16 ALU cluster will hae to wait for that one ALU to finish before switching to Vertex work? But if so, those 15 ALU's can work on other pixel shaders until the complicated one is finished, right? But then the same situation might arrise.
So what kind of break-period is needed for the shader groups to switch task? Is it a case of the scheduler deciding when to feed more data to pixel processing ALU's, and when to halt data to let them finish and then switch to Vertex work?
The ALU's are always working on groups of data (vertices or pixels) of the same state; should there be a thread with just one vertex or pixel then all but one of the processing slots in that thread will be wasted. However, there is no latency between one thread finishing and the next starting.
Simon F
14-Jun-2005, 12:24
So, my favourite things:
1. Alpha to Mask - means there's no need for supersampling to get anti-aliased alpha-blended textures - I think!
Better known as "screen-door translucency", but this would then seem to limit you to only N (=4?) levels of alpha.
Simon F
14-Jun-2005, 12:42
Each one of the ALU's is a complete instruction duplicate of the others and are all single precision IEEE floating point 32-bit compliant. The ALU's will process everything in FP32 internal precision
I presume "compliant" here probably means no denormalised numbers or NaNs, and simplified rounding. Is there any other info?
If I've got this right then, all the ALU's are working on the same shader, not one ALU per shader (can a pixel contain multiple shaders for combined effects?). Rather than take 16 pixels and process them in parallel on 16 ALUs, each of the sixteen pixels is buzzed through the ALU array with multiple ALU's working on it if needed, so they all finish at the same time pretty much. And there's no context switch, but the next batch of data, vertex or pixel, is presented. This elliminates any wait state.
:?:
I think it's fair to say that the shader array operates on 16 pixels concurrently, all running a single shader. This is effectively a "super-thread", consisting of 16 pixel threads. All 16 pixels are forced to run the same instruction in lock-step, for the duration of the shader.
In theory there's going to be pixel-group fragmentation, which is what I'm presuming the Grouper is there for. As I understand it, fragmentation would arise when a high-level dynamic branch in the shader code causes some pixels to omit execution of a called shader - e.g. when pixels are found to be in shadow. e.g., out of 100 pixels for the current triangle, all 100 pixels might execute a translucency shader, but 30 of them might also execute a shadow shader. That would create shader state fragmentation.
---
Because the ALUs can run different super-threads on successive clock cycles, it's quite possible on clock cycle 1 pixels 1-16 are being processed, and on clock cycle 2 pixels 17-32 are being processed, and then flip-flopping like this until the shader has completed.
We don't know what the interleave is.
All 32 pixels in this example would be running the same shader. They don't have to be, though. Scheduling is extremely flexible...
Also, bear in mind that every so often, the scheduler will decide to execute a vertex shader on 16 vertices, so there's nothing to suggest that all the pixels in a triangle are processed in a single "spurt" of execution, with contiguous shader scheduling. e.g. if five separate shaders are required to shade 100 pixels on a triangle, the fully shaded pixels may appear in dribs and drabs of 16-pixels at a time.
Jawed
First the 64 thread figure is not accurate.
Cool, that's the best answer!
Second a single texture thread could encompass more than one texture.
Is that solely multi-texturing or could that also encompass successive texture operations (either mutually dependent or independent)?
Jawed
AlStrong
14-Jun-2005, 14:26
- NV reckons they have near full efficiency with their current architecture
- David Kirk believes in 'eventual unification' of the shader hardware
And do you think those two statements sit happily together? (i.e. If the first is realy the case, in the context being discussed here, what is the point of doing the second?)
My guess? Flexibility for the programmer/designer/artists for the scenes they are making in conjunction with performance, what is available for use and what won't be used.
Hm, I always thought ROP stands for raster operation/operator. And PixelFX was a bit more than that, though not much.
Also note that data can be written to the eDRAM at the same time as it is being cleared from the previous data that resided there, meaning there should be little to no wait when removing the previous data from the eDRAM
How's this supposed to work? Resolve on demand? Meaning that if a new pixel is written to a position that has not yet been exported to main memory, that pixel will be resolved and the new pixel data is written do the eDRAM instead? But then you wouldn't know which pixels are from the new frame and which are still from the old one.
Or does the first tile of a new frame only use the Z-buffer space of the last tile of the previous frame for rendering? Double buffering, basically.
After the Z only rendering pass the Hierarchical Z Buffer is fully populated for the entire screen which results in the render order not being an issue.
But that is true only for opaque geometry. How is transparent geometry handled?
Where we often consider traditional pixel pipelines to be operating on pixel quads in individual triangles in a pipeline, this is not the case with Xenos
What does "often" mean in this context? Current ATI PC architectures?
darkblu
14-Jun-2005, 15:10
ok, portions of my brain still need waking up but re that fp10, how can it represent values up to 32 with 3 bits of exponent (i assume an exponent bias of 4)?
Megadrive1988
14-Jun-2005, 15:31
question(s):
with the combination of new features in-hardware that Xenos has, it almost sounded like Xenos has a programmable primitive processor (PPP). does it?
if not, what does Xenos have, if anything, that comes close to the features of a PPP?
what would a PPP offer that Xenos does not have, assuming Xenos lacks a PPP?
Simon F
14-Jun-2005, 15:48
why would it need virtual mem if it has access to the entire 512MB? :P
A CPU has access to all of its memory yet has virtual memory.....
ok, portions of my brain still need waking up but re that fp10, how can it represent values up to 32 with 3 bits of exponent (i assume an exponent bias of 4)?
1 sign bit, 6 mantissa bits, 3 exponent bits. Implicit one, exponent bias of -3, denorm support.
So 0'111111'111 represents the number +1.111111b * 2^(7-3) = +31.75.
Denorm support means the smallest representable number above zero is 0'000001'000, representing 0.000001b * 2^(-2) = 1/256.
Hm, for some reason I find it more intuitive to write the bias as the value that needs to be added to the stored exponent to get the real one. Instead of the other way round, like IEEE754 does :?
dukmahsik
14-Jun-2005, 16:06
does anyone know the number of rsx transistor count compared to the new 382ish million tranny count of the xenos?
krychek
14-Jun-2005, 16:06
question(s):
with the combination of new features in-hardware that Xenos has, it almost sounded like Xenos has a programmable primitive processor (PPP). does it?
if not, what does Xenos have, if anything, that comes close to the features of a PPP?
what would a PPP offer that Xenos does not have, assuming Xenos lacks a PPP?
From the Arstechnica article, it looked as if all the PPP stuff, "Procedural Synthesis" will be handled by a thread on the CPU. The Xenon level 2 cache has some special feature just for this purpose. A thread can use a small part of the L2 cache in a special way when its put in the "write streaming mode". Data generated by the thread is accessible to the GPU from the L2 cache itself - it doesn't goto the main memory. Besides, if there was a PPP in Xenos, they would definitely advertise it IMO :D
Simon F
14-Jun-2005, 16:19
Hm, I always thought ROP stands for raster operation/operator.
Yes, same here. Then again, I always thought "temporal antialiasing" meant... :P
Megadrive1988
14-Jun-2005, 16:33
does anyone know the number of rsx transistor count compared to the new 382ish million tranny count of the xenos?
about 300 million transistors will be in RSX. Nvidia says over 300 million transistors, which translates into: *slightly* over 300 million. or otherwise if it was significantly more than 300M, Nvidia would be singing about it.
darkblu
14-Jun-2005, 16:41
ok, portions of my brain still need waking up but re that fp10, how can it represent values up to 32 with 3 bits of exponent (i assume an exponent bias of 4)?
1 sign bit, 6 mantissa bits, 3 exponent bits. Implicit one, exponent bias of -3, denorm support.
so the bias is 3, not 4. that would mean a min exponent of -2, which would provide the whooping min positive value of .25 :evil:
ed: sorry, completely overslept the denom support part.
Colourless
14-Jun-2005, 17:07
This post http://www.beyond3d.com/forum/viewtopic.php?p=539428#539428 by Thowllly indicates possibility how it would work.
I'm going to guess that it's more like this (where s is the sign bit):
s-111-xxxxxx = 1xxxxxx000000 (16.0 - 32.0)
s-110-xxxxxx = 01xxxxxx00000 (8.0 - 16.0)
s-101-xxxxxx = 001xxxxxx0000 (4.0 - 8.0)
s-100-xxxxxx = 0001xxxxxx000 (2.0 - 4.0)
s-011-xxxxxx = 00001xxxxxx00 (1.0 - 2.0)
s-010-xxxxxx = 000001xxxxxx0 (0.50 - 1.0)
s-001-xxxxxx = 0000001xxxxxx (0.25 - 0.50)
s-000-xxxxxx = 0000000xxxxxx (0.00 - 0.25)
Does anyone know what project any of these guys worked on before the C1?
Robert Feldstein
Joe Cox, Director
Clay Taylor
Mark Fowler
however ATI are keen to point out that while there may be apparent similarities the designs are entirely independent as there are distinct virtual and physical barriers between the groups working on the various console developments, past and present, and no members of the Flipper architecture team were involved in Xenos's development
Since it would have been difficult for anyone to foresee ArtXs acquisition by ATI and ATI's subsequent contracts to design GPUs for both Nintendo and MS, I would think that the design of the Flipper chip would have been disseminated widely throughout ATI. After all I assume that ATI was looking at ARTx as an asset beyond its contract with Nintendo. The IP and personal would have been the real value especially considering that there was no guarantee that they would produce the followup to the flipper.
darkblu
14-Jun-2005, 18:04
I'm going to guess that it's more like this (where s is the sign bit):
s-111-xxxxxx = 1xxxxxx000000 (16.0 - 32.0)
s-110-xxxxxx = 01xxxxxx00000 (8.0 - 16.0)
s-101-xxxxxx = 001xxxxxx0000 (4.0 - 8.0)
s-100-xxxxxx = 0001xxxxxx000 (2.0 - 4.0)
s-011-xxxxxx = 00001xxxxxx00 (1.0 - 2.0)
s-010-xxxxxx = 000001xxxxxx0 (0.50 - 1.0)
s-001-xxxxxx = 0000001xxxxxx (0.25 - 0.50)
s-000-xxxxxx = 0000000xxxxxx (0.00 - 0.25)
yes, my concern from my last post was particularly re that last line - as you don't get an exponent of 0 (that would be reserved for representing denormalized values) you end up with a minimal biased exponent of -2, hence a (normalized) minimal positive of .25 . all that given that i missed the denormalization support part, which, of course, kinda saves the day.
Titanio
14-Jun-2005, 18:06
does anyone know the number of rsx transistor count compared to the new 382ish million tranny count of the xenos?
Dave's 150m figure for the daughter die is a guess. It would be 382m if it were true, but we don't know yet. It's certainly 3xx, of course. However beyond 232m transistors, most of the transistors are taken up by eDram memory, and thus any comparison with non-eDram chips should be qualified.
OICAspork
14-Jun-2005, 18:09
The IP and personal would have been the real value especially considering that there was no guarantee that they would produce the followup to the flipper.
I'm pretty sure ATI made an agreement to produce Flipper's successor before reaching an agreement with Microsoft. I personally wonder how long Hollywood has been in development. If development started immediately after the deal with Nintendo it is likely that Hollywood has been in development for a year longer than C1 and won't be shipping until a year after it. I'm really curious about how it will stack up to the competition.
The IP and personal would have been the real value especially considering that there was no guarantee that they would produce the followup to the flipper.
I'm pretty sure ATI made an agreement to produce Flipper's successor before reaching an agreement with Microsoft. I personally wonder how long Hollywood has been in development. If development started immediately after the deal with Nintendo it is likely that Hollywood has been in development for a year longer than C1 and won't be shipping until a year after it. I'm really curious about how it will stack up to the competition.
My point (guess) is that ATI bought ArtX with the intention of using its IP wherever it saw fit and since it was acquired before the MS deal information WRT its design would have been available to any number of people in the company. If so , ATI's comment on the barriers between development groups, while true do not preclude Flipper details from having been disseminated prior to there being a plurality of console development groups.
Headstone
14-Jun-2005, 18:43
The latest rumor hot of the press:
http://www.gamesindustry.biz/content_page.php?aid=9485
dukmahsik
14-Jun-2005, 18:48
The latest rumor hot of the press:
http://www.gamesindustry.biz/content_page.php?aid=9485
i doubt those claims as nintendo already has said they are not supporting HD for next generation. no need for 12mb edram then.
darkblu
14-Jun-2005, 19:06
The latest rumor hot of the press:
http://www.gamesindustry.biz/content_page.php?aid=9485
i doubt those claims as nintendo already has said they are not supporting HD for next generation. no need for 12mb edram then.
you say that from personal experience as a graphics programmer or?
dukmahsik
14-Jun-2005, 19:18
The latest rumor hot of the press:
http://www.gamesindustry.biz/content_page.php?aid=9485
i doubt those claims as nintendo already has said they are not supporting HD for next generation. no need for 12mb edram then.
you say that from personal experience as a graphics programmer or?
come, you don't have to be a programmer to use logic :lol:
I am stating that because what else would the use for edram be for? I am keeping in mind that ATI is designing both gpus (possibly from the same design team). if edram is included then in all likeliness that features wouldn't be too different for the xenos's edram would it? if ninty said themselves they have no desire for HD at this time then why the inclusion of an expensive edram?
AlStrong
14-Jun-2005, 19:26
you don't need to go HD if you want to use eDRAM though.
you can fit in 640x480 + 4xMSAA :wink:
If the non-HD thing is true in the end, then devs can push some really nice effects at a high frame rate, and they won't necessarily need as crazy a graphics chip to compete on that end.
darkblu
14-Jun-2005, 19:31
The latest rumor hot of the press:
http://www.gamesindustry.biz/content_page.php?aid=9485
i doubt those claims as nintendo already has said they are not supporting HD for next generation. no need for 12mb edram then.
you say that from personal experience as a graphics programmer or?
I am stating that because what else would the use for edram be for?
1) fat FSAA with 64/128bit HDR
2) MRTs with FSAA and HDR
3) texture render targets that do not obligatory get flushed to main mem
4) local caches for virtual-texturing and post-tessalator's vertex buffers
5) anything else that i did not think of within those 5 secs that it took me to write the above four.
I am keeping in mind that ATI is designing both gpus (possibly from the same design team).
possibly not.
if edram is included then in all likeliness that features wouldn't be too different for the xenos's edram would it?
xenos' edram is not exactly abund, even for its own purposes.
come, you don't have to be a programmer to use logic
logic serves you not if you don't have a clue about the matter.
aaronspink
14-Jun-2005, 22:03
Regards KK's comments, he questions how ATi could avoid stalls. He didn't have the benefit of Dave's article to explain the answer to him. We've got insight now that no-one had before. Would you lame-brains please stop trolling around with your insufferably invasive 'so and so's lying' nonsense and stick to talking about the points of the hardware!
KK is just doing his job of marketing but don't make it sound like he was being honest and upfront. FIFO queuing and scheduling of linear logic pipelines is hardly new or earthbreaking. People have been doing it in systems with much greater uncertainties and much higher complexities since the early 80's.
Regards the 16 ALU arrays, before a swtich between Vertex and Pixel work, the array will have to have finished it's current load, right? Does each ALU process one shader program? Will there be a situation where processing a pixel group, 15 ALU's finish their simple shader programs while the 17th ALU is working on a complicated shader, and that 16 ALU cluster will hae to wait for that one ALU to finish before switching to Vertex work? But if so, those 15 ALU's can work on other pixel shaders until the complicated one is finished, right? But then the same situation might arrise.
A better way to think of the SIMD array is as a barrel processor. A barrel processor (Tera is one example) processes an instruction from a different thread each cycle.
So what kind of break-period is needed for the shader groups to switch task? Is it a case of the scheduler deciding when to feed more data to pixel processing ALU's, and when to halt data to let them finish and then switch to Vertex work?
Assuming ATI has heard of pipelining, the likely answer is 0 cycles of break period.
Aaron Spink
speaking for myself inc.
dukmahsik
14-Jun-2005, 22:07
The latest rumor hot of the press:
http://www.gamesindustry.biz/content_page.php?aid=9485
i doubt those claims as nintendo already has said they are not supporting HD for next generation. no need for 12mb edram then.
you say that from personal experience as a graphics programmer or?
I am stating that because what else would the use for edram be for?
1) fat FSAA with 64/128bit HDR
2) MRTs with FSAA and HDR
3) texture render targets that do not obligatory get flushed to main mem
4) local caches for virtual-texturing and post-tessalator's vertex buffers
5) anything else that i did not think of within those 5 secs that it took me to write the above four.
I am keeping in mind that ATI is designing both gpus (possibly from the same design team).
possibly not.
if edram is included then in all likeliness that features wouldn't be too different for the xenos's edram would it?
xenos' edram is not exactly abund, even for its own purposes.
come, you don't have to be a programmer to use logic
logic serves you not if you don't have a clue about the matter.
nice really nice, what do you have to say about nintendo wanting to keep costs at an absolute minumum much like the GC compared to the other consoles? would they include a costly feature such as 12mb edram? if they do, I am all for it as I like nintendo and their games. but I don't see it happening still for reasons I have listed while your opinion is otherwise.
aaronspink
14-Jun-2005, 22:10
Each one of the ALU's is a complete instruction duplicate of the others and are all single precision IEEE floating point 32-bit compliant. The ALU's will process everything in FP32 internal precision
I presume "compliant" here probably means no denormalised numbers or NaNs, and simplified rounding. Is there any other info?
Likely, the only arcitecture that ever fully supported IEEE floating point was x87 which is more because they were design/based on the same concepts vs x87 or IEEE floating point being the "correct" way to do it.
There should be little point in NaNs in graphics, likewise denormalised numbers. And most likely round to nearest as they in general is the most used rounding method.
aaron spink
speaking for myself inc.
SanGreal
14-Jun-2005, 22:13
The latest rumor hot of the press:
http://www.gamesindustry.biz/content_page.php?aid=9485
Those specs are just a slightly modified version of the previously discussed specs (http://www.beyond3d.com/forum/viewtopic.php?t=23924).
The specs originate from a blog and the original list is not very credible, imo.
ralexand
14-Jun-2005, 22:35
A better way to think of the SIMD array is as a barrel processor. A barrel processor (Tera is one example) processes an instruction from a different thread each cycle.
Cool! I dig that analogy. Thanks.
aaronspink - totally OT, but do you know why x87 allows rounding precision to be set separately from the actual precision of the datatype being operated on?
darkblu
15-Jun-2005, 00:02
aaronspink - totally OT, but do you know why x87 allows rounding precision to be set separately from the actual precision of the datatype being operated on?
there's not such thing of actual precision of the datatype. theres' fpu stack/register fixed width (80bits plus some additional LSBits). on top of that you specify in what part of those bits you are interested in - namely, the precision.
darkblu
15-Jun-2005, 00:10
nice really nice, what do you have to say about nintendo wanting to keep costs at an absolute minumum much like the GC compared to the other consoles?
cost is always a trade-off. ninty may or may not include 12MB of edram, it is a question of market viability (from their point of view, that is). the original question was about application utilization of that memory - an entirely different matter, and what i tired to show you is that 12MB of fast, local video mem and HD support are not necessarily coupled in the way that article tried to imply. what you do with that is up to you.
would they include a costly feature such as 12mb edram? if they do, I am all for it as I like nintendo and their games. but I don't see it happening still for reasons I have listed while your opinion is otherwise.
again, i am not stating revolution will have 12MB of edram. what i am saying is that that amount of memory could easily find use in standard-def applications.
darkblu - you mean x87 doesn't specify operand precision in the instructions?
(i have never done any x87 assembly programming nor do i ever want to).
dukmahsik
15-Jun-2005, 00:44
nice really nice, what do you have to say about nintendo wanting to keep costs at an absolute minumum much like the GC compared to the other consoles?
cost is always a trade-off. ninty may or may not include 12MB of edram, it is a question of market viability (from their point of view, that is). the original question was about application utilization of that memory - an entirely different matter, and what i tired to show you is that 12MB of fast, local video mem and HD support are not necessarily coupled in the way that article tried to imply. what you do with that is up to you.
would they include a costly feature such as 12mb edram? if they do, I am all for it as I like nintendo and their games. but I don't see it happening still for reasons I have listed while your opinion is otherwise.
again, i am not stating revolution will have 12MB of edram. what i am saying is that that amount of memory could easily find use in standard-def applications.
okay I understand where you are coming from. I guess I am liken to compare it to the xenos' edram where it does provide HD affects at almost no perf costs. Being that the design is also coming from ATI also lends me to think so. 8)
Mintmaster
15-Jun-2005, 00:55
First of all, Dave, just want to thank you for the excellent article! Looks like you did a lot of digging to get this information.
As for this:
Dave, Ken Kutaragi has a question :P
The vertex shader and pixel shader are unified in ATI's architecture, and it looks good at one glance, but I think it will have some difficulties. For example, some question where will the results from the vertex processing be placed, and how will it be sent to the shader for pixel processing. If one point gets clogged, everything is going to get stalled. Reality is different from what's painted on canvas. If we're taking a realistic look at efficiency, I think Nvidia's approach is superior.
GameSpot (http://www.gamespot.com/news/2005/06/13/news_6127392.html)
And great job! Congratulations!
That explanation is exactly why ATI's solution is superior, not NVidia's. All architectures have a FIFO between the vertex and pixel pipes to buffer out fluctuations. In a traditional architecture, if that FIFO gets full, the vertex shader unit sits around. If it's empty, the pixel shader unit sits around.
You can always shuffle around processing power to make sure nothing does get clogged like it does in a traditional architecture. You can probably use a smaller fifo as well for this reason.
One more point: Some people were talking in other threads about making a game balanced (esp. for a console) so that a unified architecture is pointless. The thing is that loads change throughout the drawing of an object, let alone a whole game. When an object rotates and moves, you get clusters of triangles with only a few pixels each (e.g. when far away or on a steep angle), and cluster with the opposite.
ralexand
15-Jun-2005, 01:48
One more point: Some people were talking in other threads about making a game balanced (esp. for a console) so that a unified architecture is pointless. The thing is that loads change throughout the drawing of an object, let alone a whole game. When an object rotates and moves, you get clusters of triangles with only a few pixels each (e.g. when far away or on a steep angle), and cluster with the opposite.
Yes, that's an important point. You can't just say okay I'm going to optimise my app so that vertex to pixel ratio matches the hardware. Alot will still depend on what's in view and how close those objects are and that's a very dynamic thing.
Mintmaster - I agree that a unified approach is better. The Xenos embodiment seems to be extremely cool.
However; assuming the NV40 vertex shaders are fully MIMD and Xenos does not regroup vertex batches on the fly, vertex programs with lots of dynamic branches could still end up faster on a traditional architecture.
So I was thinking that maybe it does make sense to provide different types of processing units (say one set of units which are tailored for branchy more general purpose code and another set for streaming computation).
Ideally you woudn't assign work to these two types of units based on a pixel/vertex distinction. Instead, work would be assigned based on the type of program being run...
what do you think?
Reverend
15-Jun-2005, 02:34
Does this article (or anywhere else on the web for the matter) say that Xenos is basically "done", or is there still time for "changes"? X360 is scheduled for this Christmas, right?
Megadrive1988
15-Jun-2005, 02:38
at this point in time, Xenos has to be completely done. as far as design, architecture, features. it was probably done many months ago. late 2004 to early 2005. there's only a few months before Xbox 360 has to be in stores. manufacturing has to begin probably anywhere from late July to early September (im betting on August) the only thing that could change is the core clockspeed and memory clock speed. well that's my guess anyway.
xbox 360 program http://www.gamerzforce.com/360.zip
how much longer until xbox 360 should be comming out, program is quite useless though.
darkblu
15-Jun-2005, 03:43
darkblu - you mean x87 doesn't specify operand precision in the instructions?
(i have never done any x87 assembly programming nor do i ever want to).
unless it's a loading/storing instruction - no. ops precision is controlled through the MCW (machine control word), and once set it is in effect until changed. the fpu registers, though, are of fixed width (80bits), they just keep a different "number of significant bits", so to say, depending on that MCW precision control state.
ed: the various x86 fp simd extensions are a quite different story, though. historically they've all originated as unconditionally-fp32-based, except for some specific instructions which by default would output less precision but would allow extra "refinment" iterations to produce higher/full precision.
so basically... if I want to enforce a particular precision inside a library, I have to push/pop the value of this fpu control word on every library entry point (or will the compiler/linker handle this for me)? Alternatively, can modern compilers just completely punt on x87 and exclusively generate SSE[1-3] code instead?
cheers,
Serge
edit: specifically, I need to be in a situation where precision is IEEE double and rounding mode is IEEE unbiased, down to the results of +/-,* ops in registers. Can SSEx cater for this?
Many thanks.
darkblu
15-Jun-2005, 04:15
okay I understand where you are coming from. I guess I am liken to compare it to the xenos' edram where it does provide HD affects at almost no perf costs. Being that the design is also coming from ATI also lends me to think so. 8)
well, whatever their hollywood team end up with, i do hope we get a sane configuration of local video memory (tm). if we want to speculate (who, me? - never!) and take the flipper as a basis for extrapolation, they could (eventually, under good weather conditions) provide a tad more of embedded mem than the bare minimum to fit the vanilla set of framebuffers.
darkblu
15-Jun-2005, 04:26
so basically... if I want to enforce a particular precision inside a library, I have to push/pop the value of this fpu control word on every library entry point (or will the compiler/linker handle this for me)? Alternatively, can modern compilers just completely punt on x87 and exclusively generate SSE[1-3] code instead?
cheers,
Serge
edit: specifically, I need to be in a situation where precision is IEEE double and rounding mode is IEEE unbiased, down to the results of +/-,* ops in registers. Can SSEx cater for this?
Many thanks.
the default fp precision of all x86 c/c++ compilers i've come across is double (could be in the ansi standard just as well, i don't remember). the default fp round mode.. should be nearest, IIRC. and no, unless you have a purebred vectorizing compiler (say, intel's) where you use special data types to hint the compiler, you won't get any automagic compilation of user's code to simd; the best you can expect fom the rest of the compilers who claim simd support is some optimised intrinsics.
hmm, come to think of it now, i believe the next gcc (4) was said to feature some cool auto-vectorisation capabilities..
My head hurts.
You're sui generis, Wavey. I can't imagine anyone else on the scene who would have/could have written this. (Some that "could", perhaps, but they wouldn't).
Re final conclusions, I would have thot that as efficiency goes up so does peak sustained power/heat for the same size chip (i.e. more parts cranking at once for sustained periods), tho if you also get a smaller chip for the same performance perhaps it all works out in your favor at the end.
Thanks for the excellent article, Dave: spot-on organization and writing, as well as an English tonne of information. I'm not even a game dev and I found it fascinating. Cheers for your hard work!
Relax, no editing pointers from me this time. I think everyone got to them first. =)
IgnorancePersonified
15-Jun-2005, 06:04
Great read!
More questions than answers in this thread. :) So I will add mine.
Are all those diagrams from ATI, MS or B3D?
This sounds like a "super" Northbridge and more for a traditional Intel like system. Is this the way forward for a PC or is an integrated On cpu memory controller the future?
Simon F
15-Jun-2005, 09:09
Likely, the only arcitecture that ever fully supported IEEE floating point was x87 which is more because they were design/based on the same concepts vs x87 or IEEE floating point being the "correct" way to do it.
I would think that all the modern CPUs support IEEE correctly. I seem to recall that some (possibly even some versions of x86) may have done denorms with traps, but that's a small issue.
There should be little point in NaNs in graphics, likewise denormalised numbers.
The reason I was asking was for support of GPGPU style coding, eg porting CPU-based code onto the graphics system for faster processing.
And most likely round to nearest as they in general is the most used rounding method.
I thought "round to nearest even" was what IEEE defaults to. Either way, supporting those rounding modes does make things more expensive in HW.
Simon F
15-Jun-2005, 09:11
darkblu - you mean x87 doesn't specify operand precision in the instructions?
(i have never done any x87 assembly programming nor do i ever want to).
The x87 FPU is evil.
snakejoe
15-Jun-2005, 09:26
Regards the 16 ALU arrays, before a swtich between Vertex and Pixel work, the array will have to have finished it's current load, right? Does each ALU process one shader program? Will there be a situation where processing a pixel group, 15 ALU's finish their simple shader programs while the 17th ALU is working on a complicated shader, and that 16 ALU cluster will hae to wait for that one ALU to finish before switching to Vertex work? But if so, those 15 ALU's can work on other pixel shaders until the complicated one is finished, right? But then the same situation might arrise.
So what kind of break-period is needed for the shader groups to switch task? Is it a case of the scheduler deciding when to feed more data to pixel processing ALU's, and when to halt data to let them finish and then switch to Vertex work?
The ALU's are always working on groups of data (vertices or pixels) of the same state; should there be a thread with just one vertex or pixel then all but one of the processing slots in that thread will be wasted. However, there is no latency between one thread finishing and the next starting.
Is that means one ALU array(all 16 ALUs) can only work on either vertices or pixels at one moment?
Can this array(bank) work on both vertices and pixels at the same time? like 4 of them work on vertices others work on pixels?
krychek
15-Jun-2005, 09:59
Very interesting architecture!
I have a question:
To simplify things, I am only considering a single ALU.
As I understand, the nv40 can also multithread an ALU, however all the threads must have the same "state" or must be at the same instruction.
Wheras the Xenos does not have this restriction and each new cycle it can insert a new thread which can have a completely different instruction to be executed. Does this mean that there is dedicated pipelined logic inside each ALU for EACH kind of instruction? What else might be required to support such a thing?
Why can't the nv40 do this? Shared logic, or internal bus problems or what?!?!
Is that means one ALU array(all 16 ALUs) can only work on either vertices or pixels at one moment?
Yes. All three of the arrays work independently from each other.
Can this array(bank) work on both vertices and pixels at the same time? like 4 of them work on vertices others work on pixels?
No. The array is single instruction, multiple data.
Jawed
why would it need virtual mem if it has access to the entire 512MB? :P
To not overload the BW, Right?
aaronspink
15-Jun-2005, 11:24
I would think that all the modern CPUs support IEEE correctly. I seem to recall that some (possibly even some versions of x86) may have done denorms with traps, but that's a small issue.
An Arm supports IEEE correctly and it usually doesn't have a FPU. :)
The reason I was asking was for support of GPGPU style coding, eg porting CPU-based code onto the graphics system for faster processing.
In normal code most people have it set to traps on any condition that generate a NaN. NaNs are pretty much unused.
Aaron Spink
speaking for myself inc.
mboeller
15-Jun-2005, 12:35
I have an rather naive question about the C1:
The setup-engine seems (IMHO according to the leak) be limited to around 500Mio Vertex/sec.
but the 48 unified shader should be able to T&L up to 24x10^9 (!) vertex / sec if I understand this correct.
So this would limit the vertex-shading to ~2% of the shading capabilites. If true this would mean that the C1 has a lot of capacity for lightning and pixel-shading and also that the z-only pass without lightning(?) would be extremely fast.
Is this correct?
dukmahsik
15-Jun-2005, 16:10
so is this the uber gpu everyone had expected it to be?
Simon F
15-Jun-2005, 16:39
In normal code most people have it set to traps on any condition that generate a NaN. NaNs are pretty much unused.
Yes, but if a NaN is created, at least you know something has gone AWOL. If there are no NaNs in the implementation, i.e. so that values are silently changed into something else, then you may find that the "ported" code behaves rather oddly.
John Reynolds
15-Jun-2005, 16:45
My head hurts.
You're sui generis, Wavey. I can't imagine anyone else on the scene who would have/could have written this. (Some that "could", perhaps, but they wouldn't).
I'm waiting for the day when we hear the news that Dave, while laboring in the wee-hours of the night, pops an aneurysm and is found by Neva the next morning slumped over his keyboard.
I'm waiting for the day when we hear the news that Dave, while laboring in the wee-hours of the night, pops an aneurysm and is found by Neva the next morning slumped over his keyboard.
Yeah, I found myself wondering when he has time to have a job and life. Just think, he almost certainly must have turned nearly immediately from this to 7800. Oy. Reminds me of an extraordinary period I had last year into early this year. Someone asked me when I sleep, and I responded "Saturdays".
Edit: Come to think of it, then probably very shortly after that will be a real CrossFire review, with all the different combinations that ATI provided, and then R520! I hope he's promised Neva some "us" time somewhere pleasant in August!
I have an rather naive question about the C1:
The setup-engine seems (IMHO according to the leak) be limited to around 500Mio Vertex/sec.
but the 48 unified shader should be able to T&L up to 24x10^9 (!) vertex / sec if I understand this correct.
So this would limit the vertex-shading to ~2% of the shading capabilites. If true this would mean that the C1 has a lot of capacity for lightning and pixel-shading and also that the z-only pass without lightning(?) would be extremely fast.
Is this correct?
If you pass in screen space verts sure.
Realistically you need a minimum of 4 instructions to do anything interesting so 6billion/second is more comparable to existing numbers.
But in principle yes, the setup limit is much closer to a realworld achievable number than it is in most of the current PC/Console GPU's where setup is never a bottleneck. i.e. you dould hit the limit and still do a fair amount of work on a fair number of pixels.
Of course unless a developer wants to concentrate purely on polygons it's likely that both vertex and pixel shaders will be a lot more complex and the setup limit won't actually be the bottleneck.
Of course unless a developer wants to concentrate purely on polygons it's likely that both vertex and pixel shaders will be a lot more complex and the setup limit won't actually be the bottleneck.
This is what I understand the Z-only pass is used for. Particularly if Xenos's EDRAM buffer is running in tiled mode (e.g. 3 tiles for 720p with 4xAA), the on-chip hierarchical Z-buffer is populated first by a Z-only pass which means no pixel shading.
I have to admit, right now I'm not sure what the effective "zixel" fill-rate would be in this case and whether it would be limited by the vertex rate.
Jawed
Sorry for the question but this means that Xenus will be limited to 500Mpolys/s :?:
But it can do (not setup limited) 6Bpolys/s :?:
AlStrong
15-Jun-2005, 18:50
A CPU has access to all of its memory yet has virtual memory.....
Ack... thought he meant Virtualization for WGF. :oops:
Sorry for the question but this means that Xenus will be limited to 500Mpolys/s :?:
But it can do (not setup limited) 6Bpolys/s :?:
It's a meaningless statement.
It could do 4 shader ops on 6 billion pieces of data yes, but if you can't do anything with that data it's meaningless.
The Big difference is that Triangle setup on most existing PC hardware is so fast that it's never the limiting factor. It's a great stat that doesn't really reflect triangle throughput in any real way.
My guess is in most realworld usage this will be true of Xenos aswell (but I'm expecting a lot of work per pixel and vertex). However if a dev were to concentrate on visuals with lots of polygons and we were looking at say Xbox level pixel shaders then there is a pretty good chance you could actually hit that setup limit in a real app.
ralexand
15-Jun-2005, 22:08
Can someone help my understanding here
I know that most of this is probably wrong but can someone clarify this for me:
Shader program,
texld r0, t0 ; color map
texld r1, t0 ; normal map
dp3 r2, r1_bx2, v0_bx2 ; dot(normal, light)
mul r0, r0, r2
... gets sent to Arbiter/Sequencer from UMA
||
||
\/
Arbiter/Sequencer- decides which ALU array is free and sends 1 instruction line to ALU unit?
||
||
\/
ALU- takes the instruction line and executes and waits for the arbiter to feed it another instruction either the next line of the shader or another line from a different shader that's not in a wait state.
||
||
\/
EDRAM- After all the vertex and pixel instructions are rendered for the frame, its sent to the edram for zops, AA etc.
Things I'm not sure about.
How many threads does one ALU "pipe" have?
Are instructions passed to ALU units one at a time or entire shader programs passed to an ALU "pipe"?
Whats the limiting step in this graphic pipeline?
Where do threads come in to the equation?
darkblu - thanks for the info. I don't expect the compiler to auto-vectorize - I was hoping it could just use one double out of two and packed double SIMD instructions to emulate a sane FPU.
DaveB - I have some more questions :). Under the Xenos spec-sheet it seems that a program has access to even more registers than in PS3.0.
If the running programs used all (64) available registers, keeping 64 such threads in flight would require (assuming 16 pixels/verts per thread)
64 (regs per pix) * 16 (pix per thread) * 64 (threads) * 16 (bytes per reg)
= 1MB of multi-ported register file memory.
So, I'm guessing that the number of in-flight threads is variable, and is determined by the register space needs of the programs being run. Is this the case?
Also - can more than two programs (1 vertex, 1 pixel) be run at the same time? If not, can new programs be loaded into the on chip instruction memory while a previous one is running?
My next question is about allocating physical register to threads (assuming their number is variable). I can think of 2 ways to do this that don't suck :
1. allocate registers individually. requires a 6 to N>=6 bit mapping per thread (the results could however be used by all 16 ALUs in a SIMD engine). Maybe this approach could even be extended by including flag bits in the instructions which would cause the SIMD engines to deallocate instruction source registers early (before program exit).
2. allocate registers in groups of 2,4,... Basically 1.) but saves mapping space at the cost of some register wasteage.
Are any of these anywhere close to reality?
blakjedi
16-Jun-2005, 23:13
I was wondering because obviously I dont know the answer... Does Dave get paid for this work or is B3D a hobbyist site? He and ACERT write Gartner Group level treatises on technology and everyone gets it for free... not to mention the thoughts of guys like DeanoC, SimonF, ERP, faf and nAo to name a few...to get that level of analysis anywhere else including the likes of IGN you gotta pay and they aren't half as good...
Rockster
16-Jun-2005, 23:28
Does anyone know if the Xenos texture processor supports filtering of floating point samples? It's assumed yes, but I haven't seen confirmation.
Also, how does the display chip connect to Xenos. The original leaked block diagram showed a single PCI Express lane to it. Is that accurate or does it use some other interface?
Okay, one last thing. Is it correct to say that in terms of eDram usage / Tile count, that 720p w/ 4xAA @ 32bit color is equivalent to 720p w/ 2xAA @ FP16 color, but the latter only runs at half the fillrate?
fabulous article!
amazing work Dave!
congrats!
[Brick_top]
17-Jun-2005, 11:08
Although I didn't understand 90% of the article :oops: It looks revolutionary :) I mean the technology...
Mulciber
17-Jun-2005, 12:39
]Although I didn't understand 90% of the article :oops: It looks revolutionary :) I mean the technology...
revolutionary articles!
This sounds like a "super" Northbridge and more for a traditional Intel like system. Is this the way forward for a PC or is an integrated On cpu memory controller the future?Side point off where the thread has gone since this but:
Its more than a super northbridge.
The way I see it the GPU has taken over the central place in the thing.
The 'CPU' is now kinda a General Purpose coprocessor to the Central Graphics Processor.
I mean, the GPU can reach into the CPU & nick data directly out of L2 cache :shock:
The ram is GDDR.
The main memory controller is the GPU one.
Everything hangs off the GPU in this architecture.
arrrse,
The funny thing is that's the future. Even Intel admits it, though subconsciously with their P4 is THE multimedia chip. But graphics are the biggest thing, especially with Apple really helping to make good graphics ubiquitous in all we do.
I'm dying to see this get to PCs.
Or at least to see games of types that I'd like to play & with controller types I like to use & the ability to run a general OS on the xbox2 :)
mainman
18-Jun-2005, 04:55
Impressive article to say the least. Very good read.
dukmahsik
18-Jun-2005, 05:32
does anyone know how many gigapixels/sec xenos can do?
arjan de lumens
18-Jun-2005, 06:57
does anyone know how many gigapixels/sec xenos can do?Page 4 of the article: 8 pixels per clock. Clock speed is stated to be 500 MHz. That would indicate 4 gigapixels/sec, or about half the paper performance of an X850XT. But the Xenos can do at least 4X AA with no fillrate or bandwidth penalty, which the X850XT cannot do, and there is good reason to believe that the Xenos will come much closer to theoretical specs than the X850XT in any case.
dukmahsik
18-Jun-2005, 08:55
thanks, inane broke it down for me like this for 360 vs ps3
Console | AA level | Pixel fillrate | Sample fillrate
-------------------------------------------------
X360 | 0x MSAA | 4.0 billion/s | 4.0 billion/s
X360 | 2x MSAA | 4.0 billion/s | 8.0 billion/s
X360 | 4x MSAA | 4.0 billion/s | 16.0 billion/s
PS3 | 0x MSAA | 8.8 billion/s | 8.8 billion/s
PS3 | 2x MSAA | 4.4 billion/s | 8.8 billion/s
PS3 | 4x MSAA | 2.2 billion/s | 8.8 billion/s
You should double 'samples fillrate' as RSX ROPs should handle 2 samples per cycle.
It's probably like this:
PS3 | 0x MSAA | 8.8 billion/s | 8.8 billion/s
PS3 | 2x MSAA | 8.8 billion/s | 17.6 billion/s
PS3 | 4x MSAA | 4.4 billion/s | 17.6 billion/s
Or maybe like that:
PS3 | 0x MSAA | 5.6 billion/s | 8.8 billion/s
PS3 | 2x MSAA | 5.6 billion/s | 11.2 billion/s
PS3 | 4x MSAA | 2.8 billion/s | 11.2 billion/s
Titanio
18-Jun-2005, 12:31
I've been thinking about this for a little while now, and have a few questions. Actually, this is half me thinking out loud, half asking questions, but either way, feedback and/or answers would be appreciated!
As I understand it, on a basic level unified shading allows you to use far more arbitrary mixes of vertex and pixel instructions without a drop of utilisation compared to traditional "fixed" architectures. Basically the hardware moulds itself around your instruction mix. Here's where my questions come in:
1) A common example I see used to illustrate the architecture's flexibility is if in one frame, the viewer is looking at a low poly scene, and then switches the view to a high poly scene for the next frame, USA will happily eat either with high utilisation. In this case, with a fixed architecture, in the first instance your vertex shaders may be idling. What exactly does this higher utilisation buy us, however? Someone can correct me if I'm wrong here, but I don't think we can really control the mix of instructions being sent to the card from frame to frame - using our example from above, I don't think we can, with a low poly view of the scene, dynamically increase the complexity of the pixel shaders to make our fewer polys look better, or with a high poly scene, reduce the pixel complexity on our objects to increase the vertex complexity (e.g. to handle more on screen for example). Asides from the jarring steps up and down in image quality, which wouldn't exactly be pleasing, if such a frame-by-frame analysis of the scene, and subsequent adjustment of shader complexity was possible, then utilisation on a fixed architecture wouldn't be a problem either - you'd just constantly adjust your shader complexities to always keep your vertex and pixel shaders in top flight. So assuming dynamic changes to shader complexity isn't possible, what does the higher utilisation get us? Presumably you could compute a frame with a non-ideal mix of instructions (as far as a fixed architecture is concerned) faster than a fixed architecture could, so your framerate might be higher when looking at that low poly view or whatever, but a game always has to be smooth anyway, so it seems to me you'd just be going from smooth to smoother.
2) Assuming I'm correct in 1), then I see the real benefit in a USA in scene creation - being the ability to create scenes without worrying about the relative weighting of vertex shading vs pixel shading. Though it we can't control the instruction mix frame-by-frame, we can control it as some level when creating scenes. So with a USA, you might have one room with lots of objects but lower quality pixel shading, and in another, few objects but higher quality pixel shading. You can trade off pixels against vertices and vice versa as you wish. But with fixed hardware, can't we just then create scenes that map well to our fixed proportions of vertex vs pixel processing? Perhaps we can't in the PC space, since those proportions aren't the same across all cards (?) - and this is no doubt would be dragging down those utilisation figures, which ATi have thrown out - but in a closed system with a fixed architecture we could. Obviously you won't get full utilisation still - frame-by-frame the view changes, as with the issues mentioned in 1) and you can't adjust for that - but you could adjust things so that for as much time as possible, for the typical views of the scene, your mix will map well to the architecture. On the USA, if you can't dynamically take advantage of low poly situations frame-by-frame etc. then it too will simply be working on a "scene level", with scene its given...but some views may run at a faster framerate than others, and than they would on the fixed architecture (but again, if it's a case of smooth to smoother..obviously however, with extremes it may not be possible to do the same on the fixed architecture, to run as smoothly, but that brings me to my point below..).
3) So I'm thinking then that the true beauty of USA is in that flexibility on a scene-creation level. But from a technical perspective, when it comes to comparison and if examining technical merit, is the view with say 10% vertex shading, 90% pixel shading any more technically competent than the view with a (fixed) 30% vertex and 70% pixel mix? You are trading off one against the other afterall, you're total computation if you wish is still fixed, just the proportions are different.
4) Somewhat related to 3, but how would a USA compare if you were running it through a scene a mix that mapped well to another chip's fixed architecture? I'm guessing the fixed architecture should do better with a mix that matches its hardware than a USA could do with the same mix, given that internally its dedicated shaders should be more efficient. That issue could also eat into the gains made by the USA on other mixes too.
5) Related to 4, but if a game has been designed first on a fixed architecture and optimised for it, and then ported to a USA, some of the benefits of the USA may be lost, no?
In all of the above, I'm really considering the case with closed architectures. With PCs, some if not a lot of the above points don't carry over as well.
Sorry if none of that made any sense, it's early (well, for a Saturday) and I'm not feeling my most articulate. I'd appreciate responses, or correction where I'm wrong. Thanks!
AlStrong
18-Jun-2005, 18:20
Page 4 of the article: 8 pixels per clock. Clock speed is stated to be 500 MHz. That would indicate 4 gigapixels/sec, or about half the paper performance of an X850XT.
What is the maximum theoretical pixel fillrate of the X850XT at 4xMSAA? Just divide by 4?
Rockster
18-Jun-2005, 18:31
Take the total bandwidth and divide it by 8 bytes per pixel (color + z). So max theoretical is 37.8GB/sec / 8 or approximately 4.7Gpixels/sec.
Reverend
19-Jun-2005, 06:07
Rather simplistic question : Can we know if some XB360 games will have a "Anti-Aliasing" checkbox in one of the menus (and if there'll be "2x" and "4x" options at that) or will some fillrate-friendly games already have AA "hardcoded-implemented"?
Also, if I develop a game now that I want to be made available for both the XB360 and the PS3, given the differences between the two, how hard would it be and if there really needs to be two separate developer teams? I'll probably shoot this question off to a few developers (throwing in the obvious technical differences between the two consoles) some time later but comments by you guys are welcomed.
Finally, anyone knows how much MS paid ATI?
Rather simplistic question : Can we know if some XB360 games will have a "Anti-Aliasing" checkbox in one of the menus (and if there'll be "2x" and "4x" options at that) or will some fillrate-friendly games already have AA "hardcoded-implemented"?
4X MSAA is virtually free on Xenos so it wouldn't make much sense to deactivate it for the main render target.
Maybe MS would not even let developers deactivate it as a technical requirement needed to publish a game.
Also, if I develop a game now that I want to be made available for both the XB360 and the PS3, given the differences between the two, how hard would it be and if there really needs to be two separate developer teams? I'll probably shoot this question off to a few developers (throwing in the obvious technical differences between the two consoles) some time later but comments by you guys are welcomed.
A game that doesn't really push the envelope wouldn't be to hard to develop for a single team, IMHO.
I would think ms would require 2x fsaa to be released
pjbliverpool
19-Jun-2005, 12:05
So are we saying that X360 ROPs can handle 4 samples per pixel, and thats why 4x FSAA is free in terms of fill rate? And of course its free in terms of memory bandwidth because of the eDRAM?
So why can't RSX support 4 samples per pixel without a fill rate hit aswell? Surely this is something that would be very important on a PC like card which focusses on things like FSAA?
Or is it because it has more ROPs and hence doesn't need them to handle as many samples?
Rockster
19-Jun-2005, 13:09
So why can't RSX support 4 samples per pixel without a fill rate hit aswell?
It could, but would be useless since the memory bandwidth could not support it. ROP count is largely irrelevant as even though RSX may have twice as many, the eDram will likely ensure that Xenos can sustain the greater fill rate. The question would be if the PS3 wanted to target 4xAA, could they save die space by using 8 ROP's with single cycle 4xAA vs. 16 ROP's with 2xAA.
It takes multiply cycles for a NV40's ROP to handle 4x MSAA (2 cycles AFAIK), whilst Xenos ROPs can handle 4x MSAA in one cycle.
Take the total bandwidth and divide it by 8 bytes per pixel (color + z). So max theoretical is 37.8GB/sec / 8 or approximately 4.7Gpixels/sec.
Z compression is better than that. The lowest you get is 5 bytes per pixel for X800 AFAIK. And you don't always need to write Z.
Back-Buffer = Pixels * FSAA Depth * (Pixel Colour Depth + Z Buffer Depth)
Front-Buffer = Pixels * (Pixel Colour Depth + Z Buffer Depth)
Total = Back-Buffer + Front-Buffer
i tried using this equation for the framebuffer usage but i cant seem to get the same answers which are in the article, eg
640x480 = 307200 pixels
307200 pixels*(32+32) = 19660800
19660800\8 = 2457600 - bits to bytes
2457600\1000000 = 2.4576MB -bytes to megabytes
edit - lol, doesnt matter, gotta divide by 1024 not 1000 - heat is getting to me over here.
nice really nice, what do you have to say about nintendo wanting to keep costs at an absolute minumum much like the GC compared to the other consoles? would they include a costly feature such as 12mb edram? if they do, I am all for it as I like nintendo and their games. but I don't see it happening still for reasons I have listed while your opinion is otherwise.Is it cheaper to slap on some eDRAM (assuming it's both higher yield and easier to move to smaller processes) than to spend the money on architecting and fabricating a more complex GPU (Xbox 360, PS3)? I'm thinking in terms of the PS2's GS and its 4MB EDRAM, which was likened to a Voodoo 2 on steroids, and still produces decent visuals (considering its memory limitations vs. Xbox).
Rockster
19-Jun-2005, 21:16
Z compression is better than that. The lowest you get is 5 bytes per pixel for X800 AFAIK. And you don't always need to write Z.
I haven't seen a benchmark that bears out better than 6 bytes per pixel, but the point I was trying to make is that it's a bandwidth limit not a ROP limit. People tend to get caught up with number of ROP's and forget the important part. Didn't want to see anymore posts with RSX listed at 8.8GP/sec when the max theoretical is somewhere between 3 and 4.
Z compression is better than that. The lowest you get is 5 bytes per pixel for X800 AFAIK. And you don't always need to write Z.
I haven't seen a benchmark that bears out better than 6 bytes per pixel, but the point I was trying to make is that it's a bandwidth limit not a ROP limit. People tend to get caught up with number of ROP's and forget the important part. Didn't want to see anymore posts with RSX listed at 8.8GP/sec when the max theoretical is somewhere between 3 and 4.
ATI claims up to 24:1 compression for Z (with 6xMSAA, I guess).
With RSX, we still don't know about it's ROP architecture. Because of the two different interfaces, they could well have made some changes there. But peak ROP performance indeed hardly matters for anything but Z-only passes.
Megadrive1988
20-Jun-2005, 02:50
observation ~ question:
Gamecube's Flipper GPU can do single-cycle trilinear filtering, right?
648M pixels/sec - that is with trilinear on, at least some form of it.
but Xenos can only do single-cycle bilinear filtering, correct? it would take another cycle to do trilinear filtering w/ loopback. it can still do trilinear in a single pass, but not a single cycle.
even though there is a large difference in fillrate between Gamecube-Flipper and Xbox 360-Xenos, it seems Flipper was optimised with trilinear filtering and Xenos optimised for bilinear filtering.
ok now someone with real graphics knowledge show me where I am wrong.
observation ~ question:
Gamecube's Flipper GPU can do single-cycle trilinear filtering, right?
648M pixels/sec - that is with trilinear on, at least some form of it.
but Xenos can only do single-cycle bilinear filtering, correct? it would take another cycle to do trilinear filtering w/ loopback. it can still do trilinear in a single pass, but not a single cycle.
even though there is a large difference in fillrate between Gamecube-Flipper and Xbox 360-Xenos, it seems Flipper was optimised with trilinear filtering and Xenos optimised for bilinear filtering.
ok now someone with real graphics knowledge show me where I am wrong.
That seems odd, if true. :evil:
Rockster
20-Jun-2005, 19:55
Once again, this is a bandwidth issue. Texture samples in the GameCube are made from eDram (1MB texture buffer) which allows for the additional samples per clock required for trilinear. That particular chip, since it was designed around the start of the shader era, tackled the problem of programability through extensive/exotic texture use as opposed to shaders.
richardpfeil
22-Jun-2005, 06:24
There seems to be some things happening here that are not being explictly stated. Tiling means either processing the vertices multiple times (once for each tile) or bining and deferring rendering. The article states "During the Z only rendering pass the max extents within the screen space of each object is calculated and saved in order to alleviate the necessity for calculation of the geometry multiple times." The key word here is 'Saved'. Where is this information saved? The answer seems to be MEMEXPORT.
This is purely speculation on my part, but it seems to make some sense. Here's the process...
Send all geometry to the GPU.
- All 48 shaders processing vertices.
- The results of vertex shading go to two places, rasterization and MEMEXPORT.
- Rasterization generates Z values, out to ROPs.
- Raster also calculates tile hits, and attaches info to shading results in MEMEXPORT.
- MEMEXPORT queue written to main memory.
Render each tile.
- All 48 shaders processing pixels.
- Set up tile.
- MEMEXPORT data marked for this tile sent back into GPU.
- This data goes directly back into the Rasterizer.
- Raster results sent to shaders.
- Shader results to ROPs.
If I'm right it brings up some interesting questions...
- Can the ROPs work in double color, as well as double Z modes?
- How much memory would the exported vertex shader results take up? (My guess, 64 bytes * numVerts or 64MB for roughly a million polygons)
- What is the bandwidth cost?
Bottlenecks when running this way...
- 48 Vec4 + Scalar per clock, only one triangle to rasterizer per clock. (Ouch! But that's not an unheard of vertex shader length)
- 48 Vec4 + Scalar per clock, 8 ROPS per clock. (Not to bad, just need 6 Vec4/Scalar pairs in your pixel shader)
- Triangles smaller than 8 pixels will starve the ROPs. (True in any case)
Rockster
22-Jun-2005, 12:44
I think you're miss understanding the process. The driver simply tags each vertex fetch command with the tiles that command affects. All those commands are stored by the driver since it is handling the Z-only and rendering passed. So it sends all geometry to update hier-Z and gets updated as to which tiles each command affects. Then when rendering for example tile 1, the driver only submits the commands which effect that tile. Some objects will cross tile boundries, and those will require its geometry processed multiple times. The z-only rate with 4xAA is 64 z samples per clock or 32Gzixels/sec.
The identity of the tile(s) containing the triangle still need to be stored somewhere. If you have 3 tiles, you need to know which of the three tiles a triangle intersects - e.g. a tile coverage mask, with batches of triangles' masks compressed in some meaningful way.
By making the z-only pre-pass perform transform, lighting and shading of vertices, you are left with a reduced set of vertices in screen space, rather than world or object space. They're all fully lit and should only need rasterising.
Jawed
The identity of the tile(s) containing the triangle still need to be stored somewhere. If you have 3 tiles, you need to know which of the three tiles a triangle intersects - e.g. a tile coverage mask, with batches of triangles' masks compressed in some meaningful way.
Xenos doesn't tag triangles, it tags primitives batches.
It needs to reserve just one or two more bytes in the commands buffer to save tags since you have a few tiles
By making the z-only pre-pass perform transform, lighting and shading of vertices, you are left with a reduced set of vertices in screen space, rather than world or object space. They're all fully lit and should only need rasterising.
It depends, developers can do lighting in the firts pass or in any subsequent pass
Simon F
22-Jun-2005, 14:30
By making the z-only pre-pass perform transform, lighting and shading of vertices, you are left with a reduced set of vertices in screen space, rather than world or object space. They're all fully lit and should only need rasterising.
Jawed
That approach could be very expensive if the application was doing a lot of instancing or dynamically manufacturing extra sets of texture coordinates.
Furthermore, it might be pointless saving geometry that ended up being obscured by later geometry.
richardpfeil
23-Jun-2005, 07:28
Simon: You are absolutely right that instancing would be expensive. Procedurally created geometry and terrain height-field mapping would also be a problem. It's definately pointless to store obscured geometry, but a front to back sort should be able to cull at least 60% of the polys that are occluded in the first pass (I bet 90% is obtainable). Modern GPUs already do a good job at this.
Rockster: Until vertex shading is completed the position of the vertex in screen space is unknown. I don't see how the driver (a bit of a misnomer for a console) can do the tagging. That used to work on PCs, before hardware T&L. All vertex setup happened on the CPU, and the driver had access to the vertexes position in screen space. Tiling solutions died once the driver gave up control of the vertices. Question is, how did ATI resurrect it?
@wavey
The hierarchical z featured in xenos is there any improvements over previous implementations? Does it still not work when certain z/stencil operations are done or has this been improved?
IgnorancePersonified
01-Jul-2005, 05:40
This sounds like a "super" Northbridge and more for a traditional Intel like system. Is this the way forward for a PC or is an integrated On cpu memory controller the future?Side point off where the thread has gone since this but:
Its more than a super northbridge.
The way I see it the GPU has taken over the central place in the thing.
The 'CPU' is now kinda a General Purpose coprocessor to the Central Graphics Processor.
I mean, the GPU can reach into the CPU & nick data directly out of L2 cache :shock:
The ram is GDDR.
The main memory controller is the GPU one.
Everything hangs off the GPU in this architecture.
Yeh that's sort of what I thought arrrse. Big changes afoot!
Maybe ati could intergrate a cpu into thier northbridge here and be done with it :lol:
I posted this question the the console forum but as many of you know it now locked. I hope that someone aleast finds the time to answer as i've read Dave's article on Xenos many times but there are still things that i don't quite undersatand being a nubie and all.
Since Xenos has 48ALU pipes grouped 16 x 16 x 16 & they can do both pixel & vertex shading, I was wondering if ATI designed Xenos in such away that a programmer would have the option to do the following:
A: Use each group of 16 to do a frame each.
B: Use two of the 16 to do a frame & let the third do any additional effects.
C: Let all 48 do a frame at a time.
Secondly, if Xenos can infact do either A or B, would you be able to make an educated guess as to how effient the Unified Shader Arch really is?
& lastly, just how did ATI decide that 48ALU pipes would be fine for this particular application
Mariner
25-Jul-2005, 12:32
IIRC, for each clock the 48 ALUs can do either Pixel or Vertex processing. They can't be split to do both Pixel and Vertex at the same time.
At least, I'm pretty sure that's the information we have been provided with.
Dave Baumann
25-Jul-2005, 12:44
No, they are 3 MIMD engines, so at any one point in time they will be processing three entirely separate threads, of which can be any program type. Each engine contains 16 SIMD processors which will be operating on the same data from a single thread.
Thanks Dave. :D So since what i asked is possible, which of my sugessted options would you prefer?
Dave,
In your article you made mention that final clocks were undecided. Seeing that the r520 in the 1800xl clocks at 500MHz and the latest revision had an instant 160MHz gain, it seems that the C1 may have a lot of headroom. Has there been any confirmation on the final clocks? And since I dug up this old thread let me ask;
How to scale C1?
What is the relationship, in number of transistors, between the portions of the chip devoted to scheduling and control logic to the ALU arrays and texture units and how do they scale? Does this give it an advantage compared to non USA designs whereby the transistor count will not increase as rapidly?
pakpassion
10-Oct-2005, 04:33
hey dave:
http://techon.nikkeibp.co.jp/article/NEWS/20051005/109392/20051005protecfig1.jpg
NEC posted the above picture .. is this right? they are saying that the Edram to GPU Connection is not 32 gb/s like in the article but 22.4 GB/s.
here is the picture article:
NEC Electronics exhibited the SiP (System in Package) developed for Microsoft Xbox 360 next-generation home game console at PROTEC JAPAN 2005 at Makuhari Messe from 2005/10/5.
It contains a DRAM-embedded LSI made by NEC Electronics and a graphics LSI made by TSMC in Taiwan in one package. Those bare LSI chips are laid horizontally in a package. Specifically, the graphics LSI and the DRAM-embedded LSI are connected to interposers in a package with flip chips. The reason why NEC Electronics adopted flip chip to connect them is to improve the data transfer speed between chips. In Xbox 360, the maximum data transfer speed required between the DRAM-embedded LSI and the graphics LSI is rather high at 22.4GB/sec. According to NEC Electronics, to achieve this transfer speed, they passed up wire-bonding which is popular for in-package wiring as wire-bonding results in too big wiring inductance. The work for making them in SiP is done by Microsoft.
DarkRage
28-Nov-2005, 18:13
Ok, probably stupid question, and probably nobody is going to read it as this thread is almost dead, but let's try it before creating a whole thread:
As Xenos is working with groups of 16 ALUs working on the same instruction in the same shader... what happens when we have a small triangle?
If we have a triangle with just 3 pixels... 13 ALUs are stalling?
If we have many small triangles, ALUs can work on all of them? for example, if we have got 20 triangles with the same shader, each one with 3 pixels (60 pixels in total), Xenos could be assigning 16 ALUs -one bank- to the first 16 pixels even if they are not next to each other? Or we would have 3 ALUs working and 13 ALUs waiting.
So, basically, are those banks of 16 ALUs working as a typical quad? That would be a massive waste of resources IMO.
Sorry, I got such a mess with it.
On the face of it, the wastage from single small triangles appears to be what's happening.
It's worse because a thread (batch) is actually 4 phases of 16 = 64 pixels in size.
This kind of thing appears to be a common limitation with all GPUs - other GPUs work with even larger groups of pixels (256, 1000, 4000 - roughly speaking).
NVidia GPUs appear to support 20 triangles:
http://www.beyond3d.com/forum/showthread.php?p=597388#post597388
which ameliorates the problem, somewhat - although they're dealing with pixel counts in the thousands (NV40 and older, though G70 is more like 800-1000 apparently).
Jawed
Dave Baumann
28-Nov-2005, 18:27
Geometry and pixel data are batched according to the same state. If you have lots of small objects from different commands then you'd fget wastage, if a command generates larger batches then its minimised.
I have wondered in recent weeks about < 64 fragment batches on Xenos, but I was too embarassed to ask Wavey :lol:
I didn't understand what happened when you got batch sizes that small and whether you'd just waste execution units.
I wondered that, say, there's only a single quad of fragments to process (or less than 64 at least), would the hardware process them anyway or buffer for more fragments.
When batches get so tiny it's like that, you just suffer the hit in efficiency because it'll rarely happen that way in real-world usage when using Xenos to draw games. So now I know.
There's CPU overhead and GPU register programming overhead involved with changing state so you'd probably never notice the inefficiency of the ALUs for batches that are super small.
DarkRage
29-Nov-2005, 13:03
Geometry and pixel data are batched according to the same state. If you have lots of small objects from different commands then you'd fget wastage, if a command generates larger batches then its minimised.
Thanks all for your answers.
So, compared to G70 and R520 from a theoretical point of view, can we expect better efficiency in Xenos for small triangles?
RobertR1
01-Dec-2005, 06:56
Dave going by this line in the conclusion:
"given that most of the first generation titles will not have been developed on the final hardware."
Does that indicate that the current titles are nothing more than PC ports, developed on PC hardware, DX9 Api and then optimized to run on the xbox 360? If so, what titles are you aware of that will be developed directly on the final hardware and optimized well, thus demonstrating some of the true capabilities of the 360.
I personally do not have a good idea of exactly what I am looking at when playing so I was hoping you could shed some light onto this issue. Am I just seeing a PC game on the Xbox or am I seeing a true Xbox 360 game when I'm playing PGR3 and Madden 06?
Just one question, does anyone have anymore info on the Tesselator function as the only info I can see is in thispdf (http://www.ati.com/developer/eg05-xenos-doggett-final.pdf).
Does anyone know hoe much performance it draws, can it be used as a primary rendering mode and LOD with always the ideal minimum/maximum detail basead in the distance from the camera, is it hard to implement?
Anyway any info is good.:smile:
Does anyone have a final transistor count on the NEC daughter die? I am looking for a breakdown of transistors devoted to logic and the eDRAM.
I assume that 83.88M transistors are devoted to the eDRAM. I have read on different press releases that the total transistor count is either 90M, 100M, just over 100M, 105M or 150M.
The last update by DaveBaumann indicates a figure of 105M, however it was made to the main page and has been overwritten, the main aricle was not updated:
Google Cache (http://www.google.com/search?q=cache:AOxLQCREtyIJ:www.beyond3d.com/forum/viewtopic.php%3Ft%3D23487%26view%3Dprevious+NEC+eD RAM+xbox+logic+transistor&hl=en&gl=ca&ct=clnk&cd=3)
The most popular figure right now is 90M transistors, however that means that the chip only has about 6M transistors devoted to logic.
Dave Baumann
01-Feb-2006, 18:04
105M was the latest from ATI's engineering.
mboeller
02-Feb-2006, 14:15
According to this interview:
http://interviews.teamxbox.com/xbox/1458/The-Power-of-the-Xbox-360-GPU/p1/
the dauther-die has 90Mio Transistors (see page2)
People love numbers. How many transistors does the Xenos have? Can you break that down into the parent and daughter die? Explain some of the numbers that have been mentioned, such as the 2-terabit; the 32GB/sec and 22.4GB/sec bandwidths, etc.
Bob Feldstein: 235 million transistors parent die, 90 million transistors daughter die. Bandwidth for Intelligent Memory is derived from the following events occurring every cycle:
2 Quads of samples/cycle * 4 samples * (4 bytes color + 4 bytes Z)*2 (read and write)*500mhz = 256 gbytes/sec (that is, 2 Terabits/sec).
The 22.4GB/sec link is the connection to main memory (note, incidentally, that all 512MB of Xbox 360 system memory is in one place, which makes accessing it easier from a developer perspective). The GPU is also directly connected to the L2 cache of the CPUs – this is a 24GB/sec link Memory bandwidth is extremely important, which is why we spent so much time on it. Fortunately, designing the system from the ground up gave us the freedom to build incredible bandwidth into the box.
Dave Baumann
02-Feb-2006, 15:10
I would trust the number that we have, rather than the one that Bob has mentioned. Ours comes from the Engineers.
AlStrong
02-Feb-2006, 19:53
and engineers > all :D
So your numbers are 232+100?
Brimstone
03-Feb-2006, 21:33
The frame buffer size diagram should be tweaked to differentiate between 480i and 480p imho. As it stand now it shows 480p, 720p, and 1080i but doesn't point this out.
At the moment XBOX 360 is supporting 720p (progressive scan) and 1080i (interlaced) resolutions - 720p equates to 1280x720 pixels and 1080i equates to 1920x1080 pixels, however interlacing means that only the odd horizontal lines are refreshed on one cycle and the even lives on the next, which means that the frame buffer is only ever needing to handle 1920x540 pixels per refresh.
Here are the frame-buffer sizes for these HDTV resolutions and 640x480 with a colour depth of 32-bit (which will cover both the standard integer 32-bit format and the FP10) and a 32-bit Z/stencil buffer. Naturally, the sizes will increase if a higher Z-Buffer depth or a higher bit colour depth is used:
640 x 480 = 480p
640 x 240 = 480i
The article mentions 640x480 which is a progressive resolution, but doesn't point that out.
Panajev2001a
06-Feb-2006, 08:03
Brimstone, the frame-buffer will be quite likely full-height.
The thought that 480i means half-heigth frames or the idea of only rendering fields every 1/60th of a second has been tried on PSTwo as the GS's CRTC supports this mode (two 640x224 buffers as front and back/draw buffers).
What happens though is that when frame-rate dips below 60 fps you notice screen resolution on the TV monitor to drop severly until the game regains 60 fps timing.
Likely XBOX 360, like Dreamcast, like most PSTwo games, etc... uses a full heigth back-buffer and z-buffer and downsample with the CRTC when sending the frame to the TV monitor.
and engineers > all :D
So your numbers are 232+100?
I think that the numbers are 232 + 105 = 337.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.