Complete Details on Xenos from E3 private showing!

dukmahsik

Banned
*warning: rough japanese translation...

This is the FULL detail, not tidbits of information out on the internet, this japanese website went inside a private booth , took pictures and got every technical detail on the ATI GPU for Xbox 360 that you guys will ever need, This is the Holy Grail of its Specs

Meeting place: Los Angeles Convention Center
North American time 2005 May 19th, ATI gathering the analyst and the media of graphic industry and semiconductor industry, did the detailed technical briefing of GPU for Xbox 360.

x360_g02.jpg


GPU which does not have the development details of Xbox 360-GPU - the bottleneck is created

Being to have simple preface, it probably will lightly rearrange first from here.

There is no Xbox 360-GPU, for the present, for the time being it means to be called in the general name like Xbox 360-GPU it probably will put out the formal name like "RSX of PS3", is. At the time of development, as for being presented as key concept

(1) POWER(2) Balance(3) Flexibility(4) Headroom

You say that they were 4 points above. Being abstract, being to think, that it is difficult to know, you probably will explain simply.

x360_g10.jpg


(1) the POWER has meant absolute efficiency, it becomes the meaning very that of aiming toward the design which can acquire the sufficient performance which is not defeated to PS3 of the competitive machine.

(2) Balance points to the removal of the bottleneck. The load which falls on GPU in the game machine which can operate the various games, differs every title, or depending upon every scene. In order to obtain largest performance always, while keeping advancing graphic processing, there being a bottleneck, it is troubled. Inspecting this, if there is a mechanism which automatically can lighten load element clear and others it forces this and it is thing.

(3) Flexibility pliability...... namely has meant プログラマビリティ and functional variety. As for プログラマブルシェーダ you say that with Xbox 360-GPU that you aimed toward certain ones first, as proper.

(4) Headroom points to the fact that it is remaining strength in efficiency. As for the game machine after the appearing, as for several years, unless because the architecture is fixed, efficiency is sufficient, it means to give restriction to the basis design of the game which appears.

Then, as for ATI deducing

(A) Adaptive Shader Array (プログラマブルシェーダ which can do dynamic arrangement exchange)
(B) Special functional API which is actualized Modeling Engine (by プログラマブルシェーダソフトウェ&#12450
(C) Intelligent Memory (special eDRAM which it builds in integrates 3D graphic processing logic)

With you say, in former ATI RADEON and NVIDIA GeForce it was the original architecture design which was not seen.

x360_g11.jpg


The pipeline of Xbox 360-GPU has become like this, - GPU which adopts worldwide first integrated type シェーダアーキテクチャ


The pipeline of Xbox 360-GPU architecture becomes as in the figure.

First, seeing roughly, say that what it is thought that you are surprised, in プログラマブルシェーダ, apex シ ェ - there is no division, ダ and ピクセルシェーダ. So, these it has the mechanism which is distributed to apex processing and pixel processing optionally according to the load circumstance in graphic processing where general-purpose powerful プログラマブルシェーダユニット =48 are 16 ×3 blocks, now executes this.

So, this, touched in the front step, it is the feature architecture of Xbox 360-GPU, "(a) Adaptive Shader Array" is. From here we would like to keep doing explanation each classified by block.

- Command Processor

It is the block which drawing command (the drawing list) it removes from memory and decodes.

- Vertex Grouper

The block which is planned it removes the apex data from drawing command, how keeps processing this. As for the apex data coming here, at apex シェーダ the necessity to do the illuminant processing and coordinate transformation processing of the apex unit arises.

Through Sequencer here, 48 (the 16×3) you appoint the inside which of シェーダアレイ which is as "programmable apex シ ェ - ダ". Furthermore, apex cash is easy to be effective, the sea urchin, the as needed pickup doing plural apexes, it issues in Sequencer as 1 group. Because of this, it is appointed, from シェーダアレイ, several シェーダリソース as "programmable apex シ ェ - ダ" it is variable.

- Primitive Assembly

Transparent transformation (creating the scene which was seen from point of view) doing, the operation of keeping allotting the scene which was made with the polygon which is completed to the pixel on the picture. "The luster riser" of the place as it is called GPU in so far the generally known is. As for this part there is no プログラマビリティ with the logic where function is locked completely.

- Scan Converter

The work which is divided into the pixel unit, through Sequencer, 48 you appoint the inside which of シェーダ which is as "プログラマブルピクセルシェーダ".

- Sequencer

Sequencer, if you see generally, becomes the image of the part where it controls whether 48 プログラマブルシェーダリソース which is programmable apex シ ェ - allocating to ダ or it allocates to プログラマブルピクセルシェーダ.

But, in explanation Robert Feldstein, when with the image Sequencer itself like the processor, シェーダプログラム itself is executed here is. Here it is 1 the design of boxes, but it is thought that really it has become suitable complicated and large-scale logic. Instruction set has integrated apex シェーダ and the instruction set of ピクセルシェーダ, you say that also high-level divergence estimate mechanism is installed in program execution.

According to Feldstein "almost everything it mounts the order of the range where the imagination is attached", that it is thought that with thing, CPU of PS3 (the CELL processor) SPE (Synergistic Processor Element) it is superior also it is the level which is not inferior in regard to プログラマビリティ.

- Shader Pipe

At 16 basic units at 3 blocks and total 48 シェーダユニット. The floating point real number (FP) with the vector which is formed (SIMD) it is the image of the computing element as a substance. Vector operation of 4 elements (sum of products calculation) with scalar FP operation (1 element FP) simultaneously 1 cycle (the clock) with it is possible to do. Per 1 cycle (4 element ×2 operations +1 scalar operation) to become ×48=432flop, because Xbox 360-GPU is driven 500MHz, peak efficiency becomes 432*500MHz=216GFLOPS with just シェーダ.

Furthermore, to tell the truth, imagination conversion has done this シェーダ part, in order for there to be 64 シェーダ from Sequencer, it is visible. シェーダ processing of 64 threads is done at as needed 48 シェーダ time division.

- Texture Cache/Texture Pipe

That it is for accelerating the local texture access, rather than one time the texture which is referred to (with saying a handful of Teksel data which is) the fact that it entraps is Texture Cache. Texture Pipe the Teksel data (thing of the pixel which forms texture) bears the role which is removed from the texture map which is arranged with respect to video memory. Perhaps that there are only 4 bases you can think little. If texture read-out is read-out of memory and it compares with processor cycle, good time bets to on the read-out of memory.

In other words, increasing Texture Pipe random, if speed of memory is not sufficient, there is no meaning. In case of memory zone - Xbox 360 22.4GB/sec - with thinking of best balance, it seems that becomes this constitution.

- Vertex Cache

The mechanism which accelerates the removal of the local apex data. One time the apex data which is utilized is entrapped here

- Shader Interp.

With the abbreviation of Shader Interpolator, it becomes the unit which does interpolation processing as the value which transfers to each ピクセルシェーダ vis-a-vis the output from apex シェーダ.

Finally polygon being disassembled by pixel, advances to rendering pass. In other words apex シェーダ produces plural pixel processing inevitably. In this case, those where the value which is transferred to respective ピクセルシェーダ is created become this block.

With integrated type シェーダアーキテクチャ like Xbox 360-GPU, each シ ェ - ダ apex シ ェ - becomes ダ and/or becomes ピクセルシェーダ. The case where it worked as apex シェーダ keeping the output result at one time with this block, the case where it works as ピクセルシェーダ taking over the result from here, it handles work.

Furthermore, assuming, that in the future, プログラマブルシェーダ measure version apex シ ェ - output of ダ increased the occasion where it raised, you say that it has the capacity which it can correspond.

x360_g12.jpg


x360_g14.jpg


EDRAM of 10MB which becomes with one of the feature of Xbox 360-GPU (installed type memory).In orderfor there to be even here, eDRAM of this 10MB is produced by the NEC electronics. This eDRAM is driven at 2GHz data rate, Xbox 360-GPU is connected with interface of 1,024 bit width. The zone rises to 2Tbit=256GB/sec, it is high speed very.

EDRAM of the vague capacity, this 10MB was puzzle, but it was elucidated with the latest briefing whether it is used just in something. If you say from conclusion, you can do a kind of scratchpad memory application. It is not mere cache-memory. It is utilized mainly the case of pixel processing, "one becomes firm and" can gather plural pixels here, various pixel processing it becomes the image which is done at a stroke here. So, eDRAM of Xbox 360-GPU not to be mere memory, the pixel processor is contained. The number rises to also 192.

That when you say, this high speed it does some processing with eDRAM and the mass pixel processor, Z processing (depth information processing), α processing (transparency synthesis processing), stencil processing (clipping processing) is. Furthermore, it does also アンチエリアス processing here.

Z if processing, you inspect and whether or not the pixel which it has been about to draw from now on, is already visible in depth connection with existence of image of the drawing being completed, are not visible without drawing, it cancels/it is visible, if is, it draws...... with it is needed for the processing which was said. Z processing finally is the removal of Z value from the Z buffer, and repetition of entry in other words is memory access. As for Z processing because it means to be utilized even with processing system of various shadow formations, the memory access quantity being frequent, is easy to be related to performance decrease.

Also α processing and stencil processing and アンチエリアス processing being similar, the memory access whose frequency is high combination extremely of basic transaction of operation it pierces and is stuffed. If such processing system, this high speed inside eDRAM, furthermore it processed at a stroke in parallel with the pixel processor which is installed here, it was not effective or...... with this mechanism was mounted with the notion that where you say. In the Xbox press conference Peter Moore of Microsoft "as for all games of Xbox 360 as for emphasizing that アンチエリアス processing is applied by all means", is new in memory.

According to Feldstein, if アンチエリアス processing of 4 point sample systems...... generally known "4xFSAA processing" is, performance decrease is zero, with the mechanism of this eDRAM, (or it is close to that)...... with you explain.

It is the feature function of Xbox 360-GPU which was lifted at on, (c) Intelligent Memory this functional thing is done.

Is offered in the form of シェーダソフトウェア the graphic special function which, that Modeling Engine

The order which you explain was reversed, but it is the feature function of Xbox 360-GPU which is mentioned earlier, (b) concerning Modeling Engine you will explain. This does not exist in the block diagram. If you say from conclusion, it is possible to have thinking "the software". Speaking more, if you see from the development side, those which are offered API...... is.

With the latest briefing "Modeling Engine" being similar, as the function which is offered

(&#12450 Global Illumination...... the general illumination which begins interflection and higher-order reflection
(&#12452 High Order Surface...... higher-order curved surface
(&#12454 Tone Mapping...... latitude revision after the high dynamic range rendering. Simulation of contraction control of exposure and pupil of camera

With it could increase those which were said.

If story of mounted level and substantial level was done, it means this "Modeling Engine", utilizing プログラマビリティ of プログラマブルシェーダ of very high-level Xbox, 360-GPU was made, software based シェーダライブラリ.

According to Feldstein, "プログラマブルシェーダ of Xbox 360-GPU has very high-level プログラマビリティ, furthermore, each シ ェ - the data which is handled with ダ is not restricted by the semantics of the graphics such as apex processing and pixel processing. Then, it reaches the point where the processing system which could be done so far with only CPU can be done with GPU phase, ", that it emphasizes.

(&#12450 concerning, because Feldstein says, that "the processing system which reflection of light of the plural times simulation is done" is, it is thought that perhaps they are late racing ones.

When you mention late racing, perhaps, the rial-time processing system like 3D game graphics little relation it is and can think in existence, but there is no either such a thing. Concept of local late racing is utilized in formation of the cover section map and the like which is displayed with that 3D model has done some kind of cover structure, texture. As for this it was normal because processing system of intersection decision is necessary to do with CPU, but if it is シェーダユニット of Xbox 360-GPU which, has high-level プログラマビリティ that becomes possible.

And, as for the cover section map it is something which in the usually, is prepared advance calculation, but if in frame real time it can calculate in high speed every vis-a-vis the character which this becomes deformed, if it is the self shadow which is possible to complete with the character itself it reaches the point where it can express rather easily, (as expected this now in the generation the excessiveness? ).

Throwing such light, chasing the locus, it completely grills the result in texture...... with the spherical harmonics where Microsoft starts inserting power with Direct 3D (Spherical Harmonics) the advance calculation radiance transmission which it utilizes (Precomputed Radiance Transfer:pRT) and the like it can apply the processing system which you say, probably is.

(&#12452 it is the fee char whose degree of attention is high.

Feldstein "polygon is divided unrestrictedly with various functional methods such as spline and Bezier, expressed that ecad テッセレータwas actualized".

"Ecad テッセレータ" with the graphic subsystem "WGF2.0" of Longhorn of the next generation Windows, mounting abandonment/is the fee char which is postponed. This with Xbox 360-GPU it is mounted in the form of the software, Modeling Engine.

According to the distance from point of view, also the mechanism which controls divided increase subtrahend automatically should have been possible, this substantial dynamic Level of Detail (LOD: Thing of the mechanism which increases and decreases the number of polygons which form the 3D model according to the distance from point of view) it is suitable to actualization.

With certain meaning and Xbox 360-GPU, anticipating furthermore the graphic system ahead WGF2.0 which is the next generation DirectX, it actualizes, perhaps...... with it is possible to say.

Furthermore, you say that this Modeling Engine is offered consecutively keeps being able to do upgrade as API of the development kit of Xbox 360. Also idea of the new graphic technology perhaps in the future it appears, is mounted on occasion as シェーダプログラム, means to keep being added as the new graphic special expression of Xbox 360.

Also SPE of PS3 suggested the future image of rather unique graphic processing, but Xbox 360-GPU, makes the kind of impression hold which is SPE in the GPU, putting in the meaning, possibility, is superior also is not inferior to PS3.

As a campus of real time 3D graphics, as for Xbox 360 and PS3 it probably is to conceal the possibility of differing respectively.
 
This eDRAM is driven at 2GHz data rate, Xbox 360-GPU is connected with interface of 1,024 bit width. The zone rises to 2Tbit=256GB/sec, it is high speed very.


1,024-bit bus? is this correct!?!?

if so wouldn't that make it a true 256GB/s rather than "effective"? :oops:
 
so i'm not the only one who would be incredibly suprised to see that then...

phew! thought i had fallen asleep and woken up to the next next gen...
 
DemoCoder said:
Sounds incorrect to me. I can't see a 1024-bit external interconnect @ 2Ghz.

Why not ? it seems to me like its in the x360 gpu . This may be why there is less edram . Remember dave said he was told it was 256gb/s between the r500 and edram . So i doubt its a mistake

btw what would be the bandwidth with compresion
 
I thought that the GPU has a parent die with the Shader logic ("R500") and a daughter die ("eDRAM") with logic and eDRAM. I understood from the other discussions that:

1. The connection between the logic on the "eDRAM" and the eDRAM is 256GB/s

2. The connection between the parent die and the daughter die is 32GB/s write and 16GB/s read.

Since there is logic on the eDRAM, is there even a need for there to be 256GB/s of bandwidth between the two? The eDRAM is doing all the frame buffer and is responsible for the AA (And stencils, alphas, and HDR like effects I believe), so it basically just needs to get input and output the final tiles.
 
Is there any detailed explanation on how they accomplish 4xAA practically for free? Four proper samples would surely not be free by means of the "Intelligent Memory", so I'm very curious to know where those samples come from.
 
If ATI can do a 1024 bit external bus running at 2 Ghz they don't need edram, LOL :)
Obviously it can't be an external bus..
 
VNZ said:
Is there any detailed explanation on how they accomplish 4xAA practically for free? Four proper samples would surely not be free by means of the "Intelligent Memory", so I'm very curious to know where those samples come from.
Multisampling is your friend. Color samples are just replicated (shaders do not run on 4x pixels), only zbuffer is supersampled. edram guarantuee all the bandwith is needed to achieve 4x AA without stalling
 
nAo said:
VNZ said:
Is there any detailed explanation on how they accomplish 4xAA practically for free? Four proper samples would surely not be free by means of the "Intelligent Memory", so I'm very curious to know where those samples come from.
Multisampling is your friend. Color samples are just replicated (shaders do not run on 4x pixels), only zbuffer is supersampled. edram guarantuee all the bandwith is needed to achieve 4x AA without stalling
cool .. didn't know that about MSAA .. thanx nAo :)
 
I'd probably get some sleep now, and hopefully find some elaboration on this tomorrow... I don't see how just replicating color and supersampling z can amount to an antialiasing effect. Or what use the supersampled z have for that matter. :?
 
VNZ said:
I'd probably get some sleep now, and hopefully find some elaboration on this tomorrow... I don't see how just replicating color and supersampling z can amount to an antialiasing effect. Or what use the supersampled z have for that matter. :?

It's one color per fragment, not per pixel.
 
AaronSpink mentioned in a previous thread that with MCM packaging, it wouldn't be particularly difficult to implement a 1024 bit bus between the 2 dies. Maybe they actually did it.
 
handbrake2 said:
AaronSpink mentioned in a previous thread that with MCM packaging, it wouldn't be particularly difficult to implement a 1024 bit bus between the 2 dies. Maybe they actually did it.

It wouldn't help performance unless they also increased the fillrate.
 
nAo said:
VNZ said:
Is there any detailed explanation on how they accomplish 4xAA practically for free? Four proper samples would surely not be free by means of the "Intelligent Memory", so I'm very curious to know where those samples come from.
Multisampling is your friend. Color samples are just replicated (shaders do not run on 4x pixels), only zbuffer is supersampled. edram guarantuee all the bandwith is needed to achieve 4x AA without stalling

These diagrams are nice to look at:

http://www.beyond3d.com/reviews/ati/r420_x800/index.php?p=13#aa

Green represents the colour (texture) sample and red (or blue) the z samples (geometry - i.e. depth of the triangle that covers each sample position).

Jawed
 
That e-dram stuff sounds very nice, assuming the revolution/ps3 arrive sometime between mid2k6-to late 2k6, and either were not designed with such a feature in mind, is it too late to incorporate such a thing?(My guess is that it'd be very very tough to do it cost/time wise, maybe with a late 2k6 launch, which'd be viable if MS faces production issues/shortage. Even though the ps3 realtime demos appeared to have virtually flawless IQ , I'd just like some gooey-delicious free AA.... hmmm couldn't a modded/tweaked gs/ee chip with added/increased edram perform a similar function?)
 
PS3 is not going to have eDRAM.

The Hollywood GPU in Revolution may. Flipper (the GPU in the GCN made by ArtX, now owned by ATI) had eDRAM so it looks likely.

eDRAM is nice and all, but you are talking about either putting it on die which means trading off logic and/or yields or on a separate package and deal with the headache. And even 10MB has some issues that require work arounds.

In theory it is great, but getting it to work correctly is tricky and expensive.
 
Back
Top